<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="height:80px; display: inline"  alt="INSA"/></a> 

# R Tutorial: Detect, Measure, Explain, Mitigate, Indirect Discrimination in Statistical Learning Algorithms

**Short version**

**Abstract**
Analysis of data extracted from a 1994 US census and available on the [UCI repository](http://archive.ics.uci.edu/ml/). They allow us to relate the level of income (below or above 50k$), analogous to a "solvency" or credit score, with other variables, some of which are sensitive because they indicate membership in a group protected by law: gender, ethnic origin. Different indicators of bias, therefore sources of indirect discrimination against a group, are defined and illustrated on these data. The main ones, agreed in the literature, are the disproportionate effect or *disparate / adverse impact* (DI) (*demographic equality*), the conditional error rate (*overall error equality*) and measures associated with the asymmetry of the confusion matrices conditional on the group (*equalized odds*). The tutorial leads to estimate these different biases when predicting creditworthiness by logistic regression (linear), bianire tree and then a random forest algorithm. The "official" doctrine of *testing* surveys, adapted to detect direct individual discrimination, is also evaluated on the predictions of these two algorithms. Finally, an elementary procedure of systemic bias mitigation by *post-processing* is performed in order to evaluate its impact on the prediction accuracy and other biases. The objective, in order to meet the expectations of the future [European regulation](https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificial-intelligence) on AI (*AI Act*) is to search, in an explicit and documented way for this AI system, for the least bad trade-off between prediction accuracy, explainability of a decision and control of discriminatory biases.

**Notes**
- The main results of this tutorial were used as an illustration for a presentation at a joint CNIL & Human rights advocate seminar (05/2020); they are explained in a submitted article ([Besse, 2020](https://hal.archives-ouvertes.fr/hal-02616963)) with respect to the obligations listed in the various articles of the *AI Act*.
- This tutorial can be run locally after loading or cloning the repository or in the *Google Colab* cloud by clicking on the link below:



<a href="https://colab.research.google.com/github/wikistat/Fair-ML-4-Ethical-AI/blob/master/AdultCensus/AdultCensus-R-biasDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction
### The data
Public data available on the [UCI repository] site (http://archive.ics.uci.edu/ml/) are extracted from the database of the 1994 US census. The two files *train* and *test* have been combined into one. These data are widely used and are a reference as a *benchmark* tool to compare the performances of learning methods. The objective is to predict, with more or less bias, the binary variable "annual income" greater or less than 50k$. This prediction does not impact the person but as the approach and the context are quite similar to what a bank could do to evaluate a credit risk, this example is very illustrative. This dataset is systematically used (sandbox) to evaluate the properties of fair learning algorithms because, contrary to many other datasets used for this purpose (*e.g. german credit bank*), the true value of the target variable is known as well as the ethnicity of the persons concerned.

In the initial data, 48,842 individuals are described by the 14 variables in the table below :




|Num|Libellé|Ensemble de valeurs|
|-|---------:|-------------------:|
|1|`Age`|real|
|2|`workClass`|Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked|
|3|`fnlwgt`|real|
|4|`education`|Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|5|`educNum`|integer|
|6|`mariStat`|Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|7|`occup`|Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces|
|8|`relationship`|Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
|9|`origEthn`|White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|10|`sex`|Female, Male|
|11|`capitalGain`|real| 
|12|`capitalLoss`|real|
|13|`hoursWeek`|real|
|14|`nativCountry`|United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands|
|15|`income`|>50K, <=50K|


### Data preparation

The processing starts with a detailed preparation of the data:
- reading and taking over the data, noticing that the variable `fnlwgt` (Final sampling weight) has a [special status](http://web.cs.wpi.edu/~cs4341/C00/Projects/fnlwgt) that is not very clear; it is eliminated;
- deletion of observations with missing data, errors or inconsistencies,
- grouping of highly scattered modalities, 
- removal of redundant variables. 

### Estimating bias

Of all the existing criteria for bias that might point to indirect discrimination (Zliobaitė, 2015), three were favored (see Vermat and Rubin, 2018): 
1. indirect discrimination through disproportionate impact: *disparate impact* or *demographic equality*
2. conditional error rate comparison: *overall error equality*
3. comparison of odds ratios: *conditional procedure accuracy equality* or *disparate mistreatment* or *equalized odds*.


The emphasis in this first tutorial is on estimating the *disparate impact* or *adverse impact* of gender. The confidence interval approximation (Besse et al. 2021) is compared with a *bootstrap* estimate leading to the same results. Estimates are computed on the initial database  (societal or systemic bias) and then on the income predictions obtained by two algorithms (logistic regression and random forests) to assess the risk of discrimination. The impact of post-processing attenuation of this bias is evaluated on the accuracy and other types of bias.

**Notes** 
- a [more detailed](https://github.com/wikistat/Fair-ML-4-Ethical-AI/blob/master/AdultCensus/AdultCensus-R-biasDetectionLong.ipynb) but longer notebook proposes a comparison of the impacts of the other learning algorithms and thus of their discriminatory effect as a function of the society bias (gender and ethnic origin). This allows us to better understand the importance of taking into account the interactions between the variables. 
- The site [aif360](https://aif360.mybluemix.net/) also proposes a set of examples and tutorials. It is richer: other datasets, other criteria and especially more debiasing algorithms, but presents either trivial demonstrations or examples of very sophisticated methods of bias mitigation. This tutorial is more pedagogical to understand step by step the problems. 

## Data Exploration
In this phase of the work, there are two radically different views. 
- The one illustrated by Friedler et al. (2019) consists in training an algorithm on the raw data without prior "human" exploration using statistical skills; as a matter of principle, everything is automated.
- The one proposed in this tutorial is the result of an approach requiring elementary statistical skills to explore the data, understand their structure, detect potential problems: missing data, atypical data, biases, rare classes, "abnormal" distributions...) in order to remedy them as best as possible, and to illustrate the interest of the objective pursued. 

Note that this second point of view of data knowledge is more respectful of the [EC expert guidelines for trustworthy AI](https://ec.europa.eu/futurium/en/ai-alliance-consultation) and anticipates the proposed [European regulation](https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificial-intelligence) (*AI Act*) which imposes the drafting of explicit documentation (Annex IV) on prior studies of data quality and relevance.

### Reading and first transformations
Two possibilities to load the data from the UCI repository depending on the execution mode adopted; locally after installing R or remotely in the *Google Colab* cloud. 
1. In the first case, the data is loaded at the same time as the *Github* repository,
2. In the second case, <a href="https://colab.research.google.com/github/wikistat/Fair-ML-4-Ethical-AI/blob/master/AdultCensus/AdultCensus-R-biasDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
the data and the transformation program are loaded and read in the *Google Colab* environment; it is necessary to run the cell below.

In [None]:
# Execute this cell when running in the Google Colab cloud
# otherwise data and functions are already loaded locally
system("wget -P . https://github.com/wikistat/Fair-ML-4-Ethical-AI/raw/master/AdultCensus/adultTrainTest.csv")
system("wget -P . https://github.com/wikistat/Fair-ML-4-Ethical-AI/raw/master/AdultCensus/dataPrepAdultCensus.R")
system("mkdir ../Functions")
system("wget -P ../Functions https://raw.githubusercontent.com/wikistat/Fair-ML-4-Ethical-AI/master/Functions/dispImp.R")
system("wget -P ../Functions https://raw.githubusercontent.com/wikistat/Fair-ML-4-Ethical-AI/master/Functions/overErrEqual.R")
system("wget -P ../Functions https://raw.githubusercontent.com/wikistat/Fair-ML-4-Ethical-AI/master/Functions/oddsEqual.R")

Run the data preparation program, review the code to study its main features.

In [None]:
source("dataPrepAdultCensus.R")
dataBase = dataPrepAdultCensus()
summary(dataBase)

### Basic statistical description
To highlight difficulties present on certain variables or pairs of variables. 

In [None]:
options(repr.plot.width=4, repr.plot.height=4)
hist(dataBase[,"LcapitalGain"],probability=T, main="",xlab="log(1+CapitalGain)")
boxplot(dataBase[,"LcapitalGain"], horizontal=TRUE,boxwex=.2,  outline=TRUE,  
        frame=F, col = "lightgrey", add = TRUE,at=0.2)

Asymmetric distribution, hence the need to transform certain variables before building linear models, and then highlighting inconsistencies and strong redundancies between certain variables.

In [None]:
table(dataBase$relationship,dataBase$sex)   

In [None]:
table(dataBase$education,dataBase$educNum)  

In [None]:
table(dataBase$mariStat,dataBase$relationship)

In [None]:
table(dataBase$origEthn,dataBase$nativCountry)

In [None]:
mosaicplot(table(dataBase[,"origEthn"],dataBase[,"income"]),main="", col="lightblue",cex=1.3)

Some modifications are made in the database; variables are deleted in order to keep only one presence of sensitive information: gender and ethnic origin.
- Deletion of variable 3 `fnlwgt` which has little meaning for this analysis.
- Creation of a binary variable `Child`: presence or not of children.
- Removal of variable 8 `relationship` which is redundant with gender and marital status,
- Removal of variable 14 `nativCountry` redundant with ethnic origin.
- The variable 9 `originEthn` is simplified to 2 classes: CaucYes *vs.* CaucNo

**Note** For the following, it is important that the levels of the factors are ordered in a coherent way for the correct interpretation of the contingency tables and their numbers. By convention, the pre-judged socially "unfavorable" modalities: low income, female, non-Caucasian (coded 0), precede the others: high income, male, Caucasian (coded 1). It is therefore necessary either to reorder the levels of the `income' variable or to rename the modalities to match the alphabetical order; this second choice is carried out.

In [None]:
dataBase[,"child"]=as.factor(ifelse(dataBase$relationship=='Own-child',"ChildYes","ChildNo"))
dataBase[,"origEthn"]=as.factor(ifelse(dataBase$origEthn %in% c('Amer-Indian','Asian','Black','Other'),"CaucNo","CaucYes"))
dataBase[,"income"]=as.factor(ifelse(dataBase$income=='incLow',"incB","incH"))
datBas=dataBase[,-c(3,8,14)]
summary(datBas)

Some basic *mosaicplots* show the relationship of the sensitive variables to the target (income threshold) and clearly highlight the social bias.

In [None]:
mosaicplot(table(datBas[,"sex"],datBas[,"income"]),main="", col="lightblue",cex=1.3)

In [None]:
mosaicplot(table(datBas[,"origEthn"],datBas[,"income"]),main="",col="lightblue",cex=1.3)

**Q** Comments on the biases present in the database, the imbalances of the classes.


### Sample preparation
The database is divided into two samples: training and test. 

In [None]:
summary(datBas)

Selection of the variables used in the following, the qualitative variables obtained from the quantitative ones are not included. 

In [None]:
datBas=datBas[,c("age","educNum","mariStat","occup","origEthn",
                 "sex","hoursWeek","income","LcapitalGain","LcapitalLoss","child")]
summary(datBas)

Extraction of learning, validation and test samples. You can change the initial `seed` value of the random number generator to obtain a different split between learning, validation and testing.

Since the initial sample is relatively large, and in accordance with the expectations of the *AI Act* project, a validation sample is used for hyper-parameter optimization instead of the more accurate but cumbersome cross-validation procedure.

In [None]:
set.seed(11) # initialize the generator
# extraction of samples
test.ratio=.2 # proportion of test sample
npop=nrow(datBas) # number of rows in the data
nvar=ncol(datBas) # number of columns
# size of validation and test samples
ntest=ceiling(npop*test.ratio) 
# indices of the test sample
testi=sample(1:npop,ntest)
# remaining indices of the sample
resti=setdiff(1:npop,testi); nrest=length(resti)
# indices of the validation sample
vali=resti[sample(1:nrest,ntest)]
# indices of the learning sample
appri=setdiff(resti,vali)


In [None]:
# construction of the learning sample
datApp=datBas[appri,]
# construction of the test sample 
daTest=datBas[testi,]
# construction of the validation sample
datVal=datBas[vali,]

Definition of a function to calculate the usual error rate of the confusion matrix. With modality ordering, the well-ordered are the diagonal terms of the confusion matrix.

In [None]:
tauxErr=function(table){round((table[1,2]+table[2,1])/sum(table)*100,2)}

The different results are stored in a table for the purpose of a comparative synthesis graph.

In [None]:
matRes=data.frame(matrix(0,8,4))
rownames(matRes)=c("dataBaseBias","linLogit","tree","linLogit_w_S","testingLogit","randomForest","TreeDiscrPos","RFdiscrPos")
colnames(matRes)=c("Lower","DI","Upper","Accuracy")

## Income prediction
A detailed comparison (cf. the [tutorial](https://github.com/wikistat/Apprentissage/blob/master/Adult-Census/Apprent-Python-AdultCensus.ipynb)) of most of the models and algorithms for predicting the income as a function of the different variables shows slightly better prediction results obtained by the *gradient boosting* algorithm (extrem version). Nevertheless, we can limit ourselves here to a more restricted choice of models and algorithms to understand the impact on the discrimination between:
- interpretable (linear) logistic regression,
- a binary decision tree,
- random forests (non-linear integrating interactions) but without interpretation capacity.

A binary tree leads to intermediate forecasting qualities between logistic regression and random forest but is not easily interpretable because of the "optimal" but high number of leaves.

### Forecasting by [logistic regression](http://wikistat.fr/pdf/st-m-app-rlogit.pdf)

In [None]:
# estimation of the complete model
log.lm=glm(income~.,data=datApp,family=binomial)


In [None]:
summary(log.lm)

 **Q** Interpreting the role of variables on the income threshold.

In [None]:
# Prediction of the test sample
pred.log=predict(log.lm,newdata=daTest,type="response")
# Confusion matrix for the prediction of 
# Threshold overflow
confMat=table(pred.log>0.5,daTest$income)
confMat

In [None]:
err=tauxErr(confMat)
matRes[2,4]=(100-err)/100
err

**Note** A logistic regression model with interactions, i.e. quadratic, does not lead to a significantly better prediction but requires a variable selection (*e.g. stepwise* or `both`) that is time consuming. For the sake of brevity, it is not reproduced; see the [more complete notebook](https://github.com/wikistat/Fair-ML-4-Ethical-AI/blob/master/AdultCensus/AdultCensus-R-biasDetectionLong.ipynb).

### Binary tree forecasting

In [None]:
summary(datApp)

**Warning**: uncomment the installation command below when running in the cloud or if this library is simply not installed.

In [None]:
# install.packages("rpart")

In [None]:
library(rpart)
# Estimation of complete model
tree.mod=rpart(income~.,data=datApp,cp=0.0001)

Optimisation of the penalty

In [None]:
xmat=xpred.rpart(tree.mod)
xerr=(xmat-as.integer(datApp[,"income"]))^2
CVerr=apply(xerr,2,sum)

In [None]:
as.numeric(attributes(which.min(CVerr))$names)

In [None]:
tree.opt=rpart(income~.,data=datApp,control=rpart.control(cp=as.numeric(attributes(which.min(CVerr))$names)))

In [None]:
library(partykit)
options(repr.plot.width=16, repr.plot.height=16)
plot(as.party(tree.opt), type="simple")

**Q** What about the clarity of interpretation of this model?

**Q** What about the presence of the gender variable?

Estimate of the error on the test sample

In [None]:
pred.tree=predict(tree.opt,newdata=daTest,type="class")
confMat=table(pred.tree,daTest$income)
confMat
err=tauxErr(confMat)
matRes[3,4]=(100-err)/100
err

A simpler model is obtained by increasing the penalty but with the consequence of restricting the performances. 

In [None]:
tree.int=rpart(income~.,data=datApp,control=rpart.control(cp=0.005))
plot(as.party(tree.int), type="simple")

In [None]:
pred.tree=predict(tree.int,newdata=daTest,type="class")
confMat=table(pred.tree,daTest$income)
confMat
tauxErr(confMat)

### Forecasting by [random forests](http://wikistat.fr/pdf/st-m-app-agreg.pdf)

**Q** What are the default options used below?

In [None]:
# install.packages("randomForest")

In [None]:
library(randomForest)
rf.mod=randomForest(income~.,data=datApp)
pred.rf=predict(rf.mod,newdata=daTest,type="response")
confMat=table(pred.rf,daTest$income)
confMat
err=tauxErr(confMat)
matRes[6,4]=(100-err)/100
err

**Q** Compare the results obtained, accuracy and explainability.

The calculations can also be carried out by considering ethnic origin as a sensitive variable (see the [more complete notebook](https://github.com/wikistat/Fair-ML-4-Ethical-AI/blob/master/AdultCensus/AdultCensus-R-biasDetectionLong.ipynb)) but the results are less clear-cut, less "pedagogical".

### Logistic regression model without the gender variable
A very naive approach to build a "fair" learning consists in removing the sensitive variable. The logistic regression model is then estimated without this variable in order to evaluate the impact on the bias later on.

In [None]:
# estimation  of the model 
log_g.lm=glm(income~.,data=datApp[,-6],family=binomial)

In [None]:
# Prediction
pred_g.log=predict(log_g.lm,newdata=daTest[,-8],type="response")
# Confusion matrix  
confMat=table(pred_g.log>0.5,daTest$income)
confMat

In [None]:
err=tauxErr(confMat)
matRes[4,4]=(100-err)/100
err

**Q** What about the prediction accuracy without the gender variable?

 ## Estimating Disparate Impact
### Definition
Indirect or group discrimination measures are based on a *disparate  impact* (*DI*) criterion that appeared in the US in 1971 (Barocas and Selbst, 2016) to detect discrimination in hiring. This criterion is defined by the ratio of two probabilities: the probability of benefiting from a favorable situation or decision (high income, credit, job, housing...) for a person from the group protected by the law (woman or non-Caucasian origin), over the same probability for a person from the other group (man or Caucasian origin).

*Notations*: 

- $Y$ is the target or explainable variable, here income, $Y=1$ high income *vs* $Y=0$; 
- $g(X)=\hat{Y}$ the high or low score or income prediction, $g(X)=\hat{Y}=0$: low income prediction;
- $S$ is the sensitive variable that designates the group that is in principle protected by law from possible discrimination. This is male ($S=1$) or female ($S=0$). 

The disproportionate effect measures a situation of societal bias already present in the database. 
$$DI=\frac{P(Y=1|S=0)}{P(Y=1|S=1)}.$$
It is estimated from the values of the contingency table crossing the variables $Y$ and $S$ by the ratio:
$$\frac{n_{21}}{(n_{11}+n_{21})}/\frac{n_{22}}{(n_{12}+n_{22})}.$$

Applied to the forecast $g(X)=\hat{Y}$ of the target variable $Y$, it measures the bias of this forecast and thus the discrimination risk operated by the prediction.

### `dispImp` function
A R function computes the $DI$ and additionally provides a confidence interval estimate [Besse et al, 2020](https://arxiv.org/pdf/2003.14263.pdf)
 decomposing the density function of the test statistic by the *delta method*. This function has three arguments:   
- the variable $S$ considered sensitive: a two-level ordered factor "unfavorable" and then "favorable";
- the target variable $Y$ or its prediction $g(X)=\hat{Y}$: also a factor at two levels unfavorable then favorable;
- the risk of the confidence interval, by default 5%.

This function returns the three estimates $DI$ and $IC_g$, $IC_d$ bounds of the confidence interval.

Morris S., Lobsenz R. (2000) had already suggested to compute an estimate of the *DI* by confidence interval but by making the hypothesis of Gaussian distributions; this approximation is not justified for the numbers in a contingency table.

In [None]:
source("../Functions/dispImp.R")

### Disproportionate effect or bias in the learning base 

The function `dispImp` requires that the levels of the factors are in lexicographic order: levels "unfavorable" then "favorable".

Contingency table crossing $Y$ (income) with $S$ (gender).

In [None]:
tableDI=table(datBas$income,datBas$sex)
tableDI

*Pointwise estimate* of $DI=\frac{n_{21}}{(n_{11}+n_{21})}/\frac{n_{22}}{(n_{12}+n_{22})}.$

In [None]:
round((tableDI[2,1]/(tableDI[1,1]+tableDI[2,1]))/(tableDI[2,2]/(tableDI[1,2]+tableDI[2,2])),3)

*Confidence interval estimation* approximated by *delta method*. 

In [None]:
round(dispImp(datBas[,"sex"],datBas[,"income"]),3)

**Q** Comment on the bias measured by this way, compare with the graph obtained (*mosaic plot*) during the exploration.

*[Bootstrap estimate](http://wikistat.fr/pdf/st-m-app-bootstrap.pdf) of the confidence interval*

The estimate of the confidence interval is compared with the behavior of the *DI* on *bootstrap* samples (Efron 1987).

In [None]:
B=1000 
set.seed(11)
n=nrow(datBas)
res=matrix(0,B,1)
for (i in 1:B)
    {
    boot=sample(n,n,replace=T)
    res[i,]=dispImp(datBas[boot,"sex"],datBas[boot,"income"])[2]
    }

In [None]:
options(repr.plot.width=15, repr.plot.height=8)
DI_confInt_delta <- round(dispImp(datBas[,"sex"], datBas[,"income"]), 3)
plot(res,ylim=range(res),pch='.')
lines(res,col=3,pch='.')
abline(h=DI_confInt_delta[c(1, 3)], col=2) 

 

**Q** What about the *DI* estimates on bootstrap samples compared to the bounds of the confidence interval?

A function of the `boot` library provides a bootstrap estimate of the confidence interval.

In [None]:
library(boot)
fc <- function(d, i){ 
    d2 <- d[i,]
    return(statistic=dispImp(d2$sex,d2$income)[2])
}
set.seed(11)
bootDI <- boot(datBas,fc, R=1000)
bootDI

In [None]:
boot.ci(boot.out = bootDI, type = "perc")

**Q** Compare the delta method approximation and the bootstrap estimation of the confidence interval.

**Q** Given the computation time, which one should be preferred?

The interval is finally estimated on the test sample in order to compare with these different predictions.

In [None]:
ic=round(dispImp(daTest[,"sex"],daTest[,"income"]),3)
matRes[1,]=c(ic, 1)
ic

**Q** What about the size of the confidence interval?

### Disparate impact of prediction 

The same ratio or disparate impact calculated on $g(X)$ forecasts of $Y$ rather than on $Y$ explicitly measures the effect of the forecast. It amounts to a test of the equality of favorable forecast rates between the two groups. 

The threshold value of the probability of predicting the level of income is set by default at $0.5$.

#### Logistic regression

In [None]:
Yhat=as.factor(pred.log>0.5)

In [None]:
ic=round(dispImp(daTest[,"sex"],Yhat),3)
matRes[2,1:3]=ic
ic

#### Binary tree

In [None]:
ic=round(dispImp(daTest[,"sex"],pred.tree),3)
matRes[3,1:3]=ic
ic

#### *Random Forest*

In [None]:
ic=round(dispImp(daTest[,"sex"],pred.rf),3)
matRes[6,1:3]=ic
ic

**Q** Compare the three confidence intervals of the *DI* estimate for the original data, the logistic regression forecast and the random forest forecast. Conclusion?

#### Disparate impact of predictions without the sensitive variable gender

In [None]:
Yhat_g=as.factor(pred_g.log>0.5)
ic=round(dispImp(daTest[,"sex"],Yhat_g),3)
matRes[4,1:3]=ic
ic

**Q** What happens to the DI with a forecast that does not use the sensitive variable?

The [long notebook](https://github.com/wikistat/Fair-ML-4-Ethical-AI/blob/master/AdultCensus/AdultCensus-R-biasDetectionLong.ipynb) verifies these results by considering (*Monte Carlo* cross-validation) 20 replications of the separation of the training and test samples on which three algorithms are trained: logistic regression, decision tree, *random forest*, before evaluating the observed DI on the test sample forecast. 

As expected, the accuracy depends strongly on the chosen algorithm. Moreover, we see here that the better the accuracy, the less the bias is reinforced compared to the `DIbase` of the training data. But, for a given algorithm, the *DI* is not correlated to the accuracy on a training sample.

**Q** Partial conclusion on the impact of each algorithm, whose interpretability should also be taken into account, especially with respect to the American law.

**Warning** as Friedler et al. (2019) remind us, results and conclusions can change from one dataset to another. This is already well known with respect to prediction accuracy, it is necessary to integrate it into the bias management. The results presented in this tutorial, are on some points, different from those of Friedler et al. (2019). The main reason for this is probably the difference in the strategy adopted for data preprocessing. As a matter of principle, Friedler et al. (2019) analyze without any elementary statistical perspective the raw data and thus without any prior processing. There may also be, to be verified, implementation differences between the R and Python versions of the algorithms.

### Disparate impact *vs. Testing*
#### Survey commissioned by DARES
*Testing* is originally a common method for detecting *direct* discrimination by a human. It has been "adapted" (Riach and Rich 2002) and deployed by the ([DARES](http://dares.travail-emploi.gouv.fr/dares-etudes-et-statistiques/etudes-et-syntheses/dares-analyses-dares-indicateurs-dares-resultats/testing)) of the Ministry of Labour (cf. [article Le Monde 2020](https://www.lemonde.fr/societe/article/2020/01/08/une-etude-montre-des-discriminations-a-l-embauche-significatives-en-fonction-de-l-origine_6025227_3224.html)) to detect indirect discrimination against a group by survey. It consists in evaluating the variability of a decision when only the modality of the sensitive variable is modified. 

The calculations below allow to reproduce the global results of the last DARES survey.

In [None]:
origine.i=matrix(0,10000,1);reponse.i=matrix(0,10000,1)
origine.i[4536:8910]=1;origine.i[9376:10000]=1
reponse.i[8911:10000]=1
origine=factor(origine.i,labels=c("Maghreb","France"))
reponse=factor(reponse.i,labels=c("Negative","Positive"))
table(reponse, origine)

In [None]:
100*465/5000;100*625/5000; 465/625

The ratios are indeed those of the survey. There are many, many ways to compare them in order to conclude whether or not a discirmination is considered significant. What does the disproportionate effect assessment look like?

In [None]:
options(repr.plot.width=4, repr.plot.height=4)
mosaicplot(table(origine,reponse),main="",col="lightblue",cex=1.3)

In [None]:
round(dispImp(origine,reponse),3)

**Q** Does this *testing* survey show statistically significant discrimination under the U.S. regulations (4/5 rule)?

**Q** What about the accuracy of the *DI* assessment with 10,000 resumes sent? Would it be possible to conclude for a given company?

#### *Testing* of a learning algorithm

What happens if *testing* is applied to an automatic decision driven by a learning algorithm? 

Income predictions are computed for the same people in the test sample, taking into account the initial gender and then the opposite gender. In this case, a woman for whom the income or creditworthiness prediction changes when the gender variable changes from `Female' to `Male' would be entitled to sue for direct discrimination. 

In [None]:
daTest2=daTest
# Gender change
daTest2$sex=as.factor(ifelse(daTest$sex=="Male", "Female", "Male"))
# Prediction of the "new" test sample
pred2.log=predict(log.lm,daTest2,type="response")
Yhat2=as.factor(pred2.log>0.5)

In [None]:
table(Yhat,Yhat2)

In [None]:
# distribution by gender
table(Yhat,Yhat2,daTest$sex)

**Q** Complete: There are $x+y$ people whose income expectation changes when they change gender. And the change is in the expected direction.
- $x$ women go from a low income expectation to a high income expectation
- $y$ men go the opposite way when they become women, so these men were positively discriminated against!

What would be the results of a testing survey that sends the "files" to the algorithm twice, once for each gender? We must therefore consider twice as many people by concatenating the two forecasts. This leads to the contingency table below. 

In [None]:
# concatenation function of two vectors of factor type
c.factor <- function(..., recursive=TRUE) unlist(list(...), recursive=recursive) 
Yhat=factor(Yhat,labels=c("incB","incH")); Yhat2=factor(Yhat2,labels=c("incB","incH"))
mosaicplot(table(c.factor(daTest$sex,daTest2[,"sex"]),c.factor(Yhat,Yhat2)),main="",col="lightblue",cex=.8)

In [None]:
ic=round(dispImp(c.factor(daTest$sex,daTest2[,"sex"]),c.factor(Yhat,Yhat2)),3)
matRes[5,1:3]=ic
ic

**Q** Conclusion: is testing adequate to detect algorithmic discrimination?

**Q** Think about the role of the gender variable in predictions.

#### *Testing* and "unfair" prediction 
A simple way for a company to protect itself against a *testing* operation consists in setting as predicted probability the maximum of the two probabilities obtained by exchanging the modalities of the sensitive variable. In general, choose the most favourable situation for the person whatever the observed gender. The individual discrimination detectable by *testing* is neutralized and the influence on the error rate is almost negligible. 

In [None]:
fairPredictGenre=pmax(pred.log, pred2.log) 
confMat=table(fairPredictGenre>0.5,daTest$income)
confMat;err=tauxErr(confMat)
matRes[5,4]=(100-err)/100
err

In [None]:
round(dispImp(daTest$sex,as.factor(pred.log>0.5)),3)
round(dispImp(daTest$sex,as.factor(fairPredictGenre>0.5)),3)

**Warning** This procedure masks **intentionally** direct discrimination detectable by testing while promoting indirect discrimination, it is clearly **condemnable under the penal code**. Be careful in your future professional practices!

## Explain, mitigate discrimination?
The logistic regression model notably reproduces the social bias and reinforces it by introducing discrimination; this is less clear for the random forest algorithm. Is it possible to explain this behavior or, more precisely, to use the right model or algorithm that avoids it? The [long notebook](https://github.com/wikistat/Fair-ML-4-Ethical-AI/blob/master/AdultCensus/AdultCensus-R-biasDetectionLong.ipynb) compares different algorithms in different situations, including unsuccessfully assigning more weight to women to compensate for their underrepresentation. 

The literature proposes an avalanche of methods for debiasing an algorithmic decision. Three approaches are developed:
- *Pre-processing* by debiasing the training data;
- *Processing* by penalizing the objective function with a fairness constraint but the optimization is no longer convex;
- *Post-processing* by de-biasing the decisions.

Friedler et al. (2019) and [AIF360](https://aif360.mybluemix.net/) provide a systematic numerical comparison of some of these approaches on several public datasets, including the one in this tutorial. 

A rudimentary but effective version of post-processing consists of estimating two models or training two algorithms, one for women and one for men and then adjusting the decision threshold to reduce the disproportionate effect while controlling the error rate. This procedure is tested in the cases of a binary tree and random forests. It is a way to introduce a dose of positive discrimination in order to move towards more social equity.

The first part consists in estimating the models separately before introducing positive discrimination in a second part.

**Note** It is probably not necessary to estimate two random forest models by gender. Post-processing the decision thresholds alone should be sufficient. 

### Separation of the two samples
The samples are separated into two parts.

In [None]:
datAppF=subset(datApp, sex == 'Female') 
datAppM=subset(datApp, sex == 'Male')
datValF=subset(datVal, sex == 'Female') 
datValM=subset(datVal, sex == 'Male')
daTestF=subset(daTest, sex == 'Female')
daTestM=subset(daTest, sex == 'Male')
summary(datAppM)

### Logistic regression
Estimation of the two models.

In [None]:
reg.log=glm(income~.,data=datApp,family=binomial)
# estimation of  both models
reg.logF=glm(income~.,data=datAppF[,-6],family=binomial)
reg.logM=glm(income~.,data=datAppM[,-6],family=binomial)
# comparison of parameters
summary(reg.logF);summary(reg.logM)

**Q** La comparaison des paramètres des deux modèles apporte-t-elle des informations?

In [None]:
# predictions  of the models 
yHat=predict(reg.log,newdata=daTest,type="response")
yHatF=predict(reg.logF,newdata=daTestF,type="response")
yHatM=predict(reg.logM,newdata=daTestM,type="response")
# compilation of predictions
yHatFM=c(yHatF,yHatM)

In [None]:
daTestFM=rbind(daTestF,daTestM)
dim(daTestFM)

In [None]:
# errors
table(yHatFM>0.5,daTestFM$income)

In [None]:
table(yHat>0.5,daTest$income)

In [None]:
tauxErr(table(yHatFM>0.5,daTestFM$income))

In [None]:
tauxErr(table(yHat>0.5,daTest$income))

**Q** What happens to the prediction error once the two models are combined with the same decision threshold?

In [None]:
# cumulated bias  vs. bias
round(dispImp(daTestFM[,"sex"],as.factor(yHatFM>0.5)),3); round(dispImp(daTest[,"sex"],as.factor(yHat>0.5)),3)

In [None]:
# Reminder: Bias of the test base
round(dispImp(daTestFM[,"sex"],daTestFM[,"income"]),3)

**Q** What happens to the bias?

### Binary tree 
The objective is to search for a less bad compromise between accuracy, interpretability and bias. We limit ourselves to simple trees by introducing a sub-optimal penalty which imposes a limited number of leaves. This parameter deserves to be "optimized" but the objective function is not clear, depending on political and commercial imperatives in the search for a compromise. We retain the choice made previously on the validation sample.

In [None]:
library(rpart)
# Initial model
tree.init=tree.int=rpart(income~.,data=datApp,control=rpart.control(cp=0.005))
# estimation of the two models
tree.F=rpart(income~.,data=datAppF[,-6],control=rpart.control(cp=0.005))
tree.M=rpart(income~.,data=datAppM[,-6],control=rpart.control(cp=0.005))
# tree comparison

In [None]:
options(repr.plot.width=10, repr.plot.height=10)
plot(as.party(tree.F), type="simple")

In [None]:
plot(as.party(tree.M), type="simple")

In [None]:
# model prediction
yHatTree=predict(tree.init,newdata=daTest,type="class")
yHatFtree=predict(tree.F,newdata=daTestF,type="class")
yHatMtree=predict(tree.M,newdata=daTestM,type="class")
# compilation of the predictions
yHatFMtree=c(yHatFtree,yHatMtree)

In [None]:
# cumulative error vs. initial RF error
table(yHatFMtree,daTestFM$income); table(yHatTree,daTest$income)

In [None]:
tauxErr(table(yHatFMtree,daTestFM$income))
tauxErr(table(yHatTree,daTest$income))

In [None]:
# Cumulative bias vs. initial model bias vs. base bias
round(dispImp(daTestFM[,"sex"],as.factor(yHatFMtree)),3)
round(dispImp(daTest[,"sex"],as.factor(yHatTree)),3)
round(dispImp(daTestFM[,"sex"],daTestFM[,"income"]),3)

###  *Random Forest*

In [None]:
library(randomForest)
# Initial model
RFinit=randomForest(income~.,data=datApp)
# Model by changing the weights
RFinitW=randomForest(income~.,data=datApp,weigth=w)
# estimation of the two models
RF.F=randomForest(income~.,data=datAppF[,-6])
RF.M=randomForest(income~.,data=datAppM[,-6])
# comparison of parameters

In [None]:
# model prediction
yHatrf=predict(RFinit,newdata=daTest,type="response")
yHatrfW=predict(RFinitW,newdata=daTest,type="response")
yHatFrf=predict(RF.F,newdata=daTestF,type="response")
yHatMrf=predict(RF.M,newdata=daTestM,type="response")
# compilation of forecasts
yHatFMrf=c(yHatFrf,yHatMrf)

In [None]:
# cumulative error vs. initial RF error
table(yHatFMrf,daTestFM$income); table(yHatrf,daTest$income)

**Q** Compare errors.

In [None]:
tauxErr(table(yHatFMrf,daTestFM$income))
tauxErr(table(yHatrf,daTest$income))
tauxErr(table(yHatrfW,daTest$income))

In [None]:
# Cumulative bias vs. initial model bias vs. base bias
round(dispImp(daTestFM[,"sex"],as.factor(yHatFMrf)),3)
round(dispImp(daTest[,"sex"],as.factor(yHatrf)),3)
round(dispImp(daTest[,"sex"],as.factor(yHatrfW)),3)
round(dispImp(daTestFM[,"sex"],daTestFM[,"income"]),3)

**Q** Compare biases 

### Discrimination mitigation by *post-processing*.
The  (*post-processing*) procedure below is the simplest. It consists in introducing a form of positive discrimination by modifying the decision threshold for women while keeping the default one of $0.5$ for men in order not to penalize them more. It is applied in this tutorial to the tree and random forest algorithms only. A graphical optimization procedure has been applied but not reproduced in order to control, on the validation sample, the effect of the correction on both the bias and the prediction error. The threshold chosen ($0.3$) for women with random forests to decide on a high income ($>50$k$) follows. *Warning* this threshold may depend on the validation sample and therefore on the initialization of the random number generator. To illustrate the choice of the threshold, the ROC curves below are instructive.

#### ROC curves plot

In [None]:
library(ROCR)
options(repr.plot.width=15, repr.plot.height=15)
ROCtreeM=predict(tree.M,newdata=datValM,type="prob")[,2]
predTreeM=prediction(ROCtreeM,datValM$income)
perfTreeM=performance(predTreeM,"tpr","fpr")
ROCtreeF=predict(tree.F,newdata=datValF,type="prob")[,2]
predTreeF=prediction(ROCtreeF,datValF$income)
perfTreeF=performance(predTreeF,"tpr","fpr")

ROCrfM=predict(RF.M,newdata=datValM,type="prob")[,2]
predRfM=prediction(ROCrfM,datValM$income)
perfRfM=performance(predRfM,"tpr","fpr")
ROCrfF=predict(RF.F,newdata=datValF,type="prob")[,2]
predRfF=prediction(ROCrfF,datValF$income)
perfRfF=performance(predRfF,"tpr","fpr")

par(cex=1.8,lwd=2)
plot(perfTreeM,col=1, print.cutoffs.at=c(0.5))
plot(perfTreeF,col=2, print.cutoffs.at=c(0.1, 0.5),add=TRUE)
plot(perfRfM,col=3, print.cutoffs.at=c(0.5),add=TRUE)
plot(perfRfF,col=4, print.cutoffs.at=c(0.3,0.5),add=TRUE)
legend("bottomright",legend=c("arbreMasc","arbreFem","RFmasc","RFfem"),col=c(1:4),cex=1,lwd=3)

**Objective**: to bring the estimated values on the validation sample closer to the false positive rates while reducing the systemic bias without penalizing the errors too much...

**Q** Binary trees: Comment on the "optimality" of the choice of thresholds $0.1$ for women and $0.5$ for men in view of these ROC curves.

**Q** Same thing for random forests: $0.3$ and $0.5$.

#### Binary trees with positive discrimination

In [None]:
# prediction of the models by changing the threshold of the women
yHatFtreeDP=predict(tree.F,newdata=daTestF,type="prob")[,2]>0.1
yHatMtreeDP=predict(tree.M,newdata=daTestM,type="prob")[,2]>0.5
# compilation of predictions
yHatFMtreeDP=c(yHatFtreeDP,yHatMtreeDP)

In [None]:
table(yHatFMtreeDP,daTestFM$income)

In [None]:
err=tauxErr(table(yHatFMtreeDP,daTestFM$income))
matRes[7,4]=(100-err)/100
err

In [None]:
ic=round(dispImp(daTestFM[,"sex"],as.factor(yHatFMtreeDP)),3)
matRes[7,1:3]=ic
ic

#### Random forests with positive discrimination

In [None]:
# model prediction by changing the threshold for women
yHatFrfDP=predict(RF.F,newdata=daTestF,type="prob")[,2]>0.3
yHatMrfDP=predict(RF.M,newdata=daTestM,type="prob")[,2]>0.5
# compilation of predictions
yHatFMrfDP=c(yHatFrfDP,yHatMrfDP)

In [None]:
table(yHatFMrfDP,daTestFM$income)

In [None]:
err=tauxErr(table(yHatFMrfDP,daTestFM$income))
matRes[8,4]=(100-err)/100
err

**Q** What happens to the prediction error?

In [None]:
ic=round(dispImp(daTestFM[,"sex"],as.factor(yHatFMrfDP)),3)
matRes[8,1:3]=ic
ic

**Q** What happens to the bias?
### Graphical summary of results
The previous results: prediction accuracies and confidence intervals are collected and displayed on the same graph.

In [None]:
matRes

**Warning**: uncomment the installation command below when running in the cloud or if this library is simply not installed.

In [None]:
# install.packages("Publish")
library(Publish)
options(repr.plot.width=10, repr.plot.height=5)
plotConfidence(x=matRes[,c("DI","Lower","Upper")],
               labels=data.frame("Model"=rownames(matRes),"Accuracy"=matRes[,"Accuracy"]),
               points.pch=15,points.cex=3,points.col=rainbow(6),
               values=FALSE,xlim=c(0.1,1),lwd=4,cex=1.5,
               xlab="Disparate Impact",xlab.cex=1,xratio=0.3,y.title.offset=1)

**Q** Compare the different bias taking into account the intersection of the confidence intervals. Which algorithms discriminate significantly? Is it efficient to remove the sensitive variable from the model? What about *testing*? How does positive discrimination or post-processing affect the decisions of a binary tree or random forests? Is it efficient? What is the least bad compromise between accuracy, interpretability and bias?

**Warning**, mitigating the society bias for more "fairness" also impacts the other biases, an impact that is important to consider in the next section.

## Other indicators of bias / discrimination
### Bias on conditional prediction errors or accuracies
The disparate impact is a first source of bias or discrimination among others. A second source, often mentioned, concerns the prediction errors or accuracies according to the terms of the sensitive variable; this is the "overall error equality" or, equivalently, the "overall accuracy equality".

#### Linear logistic regression
*Overall error equality*

In [None]:
table(pred.log>0.5,daTest$income,daTest$sex)

In [None]:
apply(table(pred.log>0.5,daTest$income,daTest$sex),3,tauxErr)


**Q** Which gender appears to be disadvantaged by this criterion?

In [None]:
source('../Functions/overErrEqual.R')

In [None]:
round(overErrEqual(daTest$sex,daTest$income,as.factor(pred.log>0.5)),2)

**Q** Same question.

#### Binary tree

In [None]:
table(yHatTree,daTest$income,daTest$sex)

In [None]:
apply(table(yHatTree,daTest$income,daTest$sex),3,tauxErr)

In [None]:
round(overErrEqual(daTest$sex,daTest$income,yHatTree),2)

#### Random forest

In [None]:
apply(table(yHatrf,daTest$income,daTest$sex),3,tauxErr)

In [None]:
round(overErrEqual(daTest$sex,daTest$income,yHatrf),2)

**Q** Which gender appears to be disadvantaged by this criterion?

#### Binary tree with positive discrimination

In [None]:
tauxErr(table(yHatFtreeDP,daTestF$income)); tauxErr(table(yHatMtreeDP,daTestM$income))

In [None]:
round(overErrEqual(daTestFM[,"sex"],daTestFM$income,as.factor(yHatFMtreeDP)),2)

#### *Random forest* with  positive discrimination

In [None]:
tauxErr(table(yHatFrfDP,daTestF$income)); tauxErr(table(yHatMrfDP,daTestM$income))

In [None]:
round(overErrEqual(daTestFM[,"sex"],daTestFM$income,as.factor(yHatFMrfDP)),2)

**Q** Is the evolution of this criterion logical given the correction adopted on the decision?

**Q** Is this correction socially acceptable?

###  Asymmetry of the confusion matrix: *equalitzed odds* 
Another source of discrimination is considered. It has been highlighted by the site [Propublica](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) concerning the COMPAS recidivism score of the company *equivant* used in some American courts of justice in a form of "predictive justice". The bias concerns an inversion of the asymmetry of the confusion matrix according to the sensitive variable. A large number of criteria have been proposed to evaluate this asymmetry and of which [Verma and Rubin (2018](http://fairware.cs.umass.edu/papers/Verma.pdf)) offer a review. Their definitions are based on the different frequencies from the contingency table and calculated by the function below.

In [None]:
contRatio <- function(T){ 
    # Ratios calculés à partir de la matrice de confusion
    TP=T[2,2] # true positive
    FP=T[2,1] # false positive
    FN=T[1,2] # false negative
    TN=T[1,1]  # true negative
    PPV=TP/(TP+FP) # P(Y=1|g(x)=1) positive predictive value
    FDR=FP/(TP+FP) # P(Y=0|g(x)=1) false discovery rate 
    FOR=FN/(TN+FN) # P(Y=1|g(x)=0) false omission rate
    NPV=TN/(TN+FN) # P(Y=0|g(x)=0) negative predictive value
    TPR=TP/(TP+FN) # P(g(x)=1|Y=1) true positive rate
    FPR=FP/(FP+TN) # P(g(x)=1|Y=0) false positive rate
    FNR=FN/(TP+FN) # P(g(x)=0|Y=1) false negative rate
    TNR=TN/(FP+TN) # P(g(x)=0|Y=0) true negative rate
    return(list("PPV"=PPV,"FDR"=FDR,"FOR"=FOR,"NPV"=NPV,"TPR"=TPR,"FPR"=FPR,"FNR"=FNR,"TNR"=TNR))
}

In [None]:
contRatio(table(pred.log>0.5,daTest$income))

List of equity criteria that can be defined from the previous frequencies conditionally on the sensitive variable. The combinatorial possibilities are important but can be reduced by noting that *PPV*=1-*FDR*, *FOR*=1-*NPV*, *FPR*=1-*TNR*, *FNR*=1-*TPR*... According to the authors, there is equity of treatment if:
- *Predictive parity*: the two groups have the same *PPV*s and consequently the same *FDR*s;
- *False positive error rate balance* or *predictive equality*: same *FPR*s and consequently the same *TNR*s;
- False negative error rate balance* or *equal opportunity*: same *FNR*s and consequently the same *TPR*s;
- Conditional procedure accuracy equality* or *disparate mistreatment* or *equalized odds* combines the two above: same *TPR*s **AND** same *FPR*s;
- Overall accuracy equality*: same *TPR* **AND** same *TNR*;
- Conditional use accuracy equality*: same *PPV*s **AND** same *NPV*s;
- *Teatment equality*: the ratios *FN/FP* are the same for both groups.

Many other criteria have been proposed (cf. Verma and Rubin. 2018), they are not developed here. Conditional *TPR* and *TNR* calculations are preferred below, but this is only one choice among others. Friedler et al. (2019) show that these are highly correlated justifying that it is reasonable to restrict ourselves in the first "reading" to comparisons of *TPR* and *FPR* only.

#### Linear logistic regression
The confusion matrix is constructed for each gender to compare the different loyalty indicators.

In [None]:
fairness=data.frame("Female"=as.matrix(contRatio(table(pred.log>0.5,daTest$income,daTest$sex)[,,1])),
                    "Male"=as.matrix(contRatio(table(pred.log>0.5,daTest$income,daTest$sex)[,,2])))
fairness

It would be tedious to construct all the comparisons, especially since many of these indicators are redundant. Only the ***Equalized Odds*** are estimated by confidence intervals using the function `oddsEqual` which admits 4 parameters:
- S: protected group variable
- Y: target variable
- P: prediction $\hat{Y}$.
- alpha=0.05, default value.

It provides the confidence interval estimate of the ratios of the conditional *FPR* and *TPR* and thus allows to test the equality or not of the scores according to the sensitive variable.

In [None]:
source('../Functions/oddsEqual.R')

In [None]:
round(oddsEqual(daTest$sex,daTest$income,as.factor(pred.log>0.5)),2)

**Q** Which genre seems to be favored this time in terms of this criterion?

#### Binary tree
Display of confusion matrices by gender:

In [None]:
table(yHatTree,daTest$income,daTest$sex)

Calculation of the different criteria

In [None]:
fairnessTree=data.frame("Female"=as.matrix(contRatio(table(yHatTree,daTest$income,daTest$sex)[,,1])),
                    "Male"=as.matrix(contRatio(table(yHatTree,daTest$income,daTest$sex)[,,2])))
fairnessTree

Compare, for example, the false positive and false negative rates by gender or the estimates of the reports below:

In [None]:
round(oddsEqual(daTest$sex,daTest$income,yHatTree),2)

**Q** Same question with a binary tree. What if it were a credit score assessment, what about the risks incurred by the bank by gender and thus the consequent breach of fairness?

#### *Random forest* 

In [None]:
fairnessRF=data.frame("Female"=as.matrix(contRatio(table(yHatrf,daTest$income,daTest$sex)[,,1])),
                    "Male"=as.matrix(contRatio(table(yHatrf,daTest$income,daTest$sex)[,,2])))
fairnessRF

In [None]:
round(oddsEqual(daTest$sex,daTest$income,yHatrf),2)

**Q** Same question with a random forest.

#### Binary tree with positive discrimination

In [None]:
fairnessRFDP=data.frame("Female"=as.matrix(contRatio(table(yHatFtreeDP,daTestF$income))),
                    "Male"=as.matrix(contRatio(table(yHatMtreeDP,daTestM$income))))
fairnessRFDP

In [None]:
round(oddsEqual(daTestFM[,"sex"],daTestFM$income,as.factor(yHatFMtreeDP)),2)

#### *Random forest* with positive discrimination

In [None]:
fairnessRFDP=data.frame("Female"=as.matrix(contRatio(table(yHatFrfDP,daTestF$income))),
                    "Male"=as.matrix(contRatio(table(yHatMrfDP,daTestM$income))))
fairnessRFDP

In [None]:
round(oddsEqual(daTestFM[,"sex"],daTestFM$income,as.factor(yHatFMrfDP)),2)

**Q** Does the correction go in the expected direction? 

**Q** In conclusion, is the post-processing of the threshold, for this example, in the right direction for all the bias criteria?

**Q** Comment on the "recommendations" of [Goglin (2021)](https://theconversation.com/discrimination-et-ia-comment-limiter-les-risques-en-matiere-de-credit-bancaire-167008).

## Conclusion

Conclude on 
- the choice between a linear model, a tree model, or an aggregation of trees for the quality of the prediction *vs.* the interpretability.
- the intervention of the sensitive variable in a model and thus on the effect of a prohibition to take into account a sensitive variable such as the ethnic origin,
- the effectiveness of testing, 
- considering the three types of bias considered, which one would highlight a breach of equity according to gender with regard to the risks incurred by a bank?
- the impacts of the rudimentary post-processing implemented on the accuracy, the other biases.

Finally, which algorithm to choose and how to justify it from an economic point of view for a bank but also from the point of view of the social image of fairness of the procedure.

**Remarks**.
- It is probably not necessary to estimate two models according to gender. Post-processing of the decision thresholds alone should suffice.
- In the USA, the calculation of the *adverse* or *disparate impact* is taken into account in the hiring process by obligation of the labor code by maintaining "ethnic statistics". In France, only testing operations are deployed.
- In the USA, the use of a non-linear algorithm without control is risky because a bias that is too important ($DI<0.8$) without explanation, and therefore possible justification, is condemnable. This is why some American `hiring tech' companies offer bias mitigations to save lawsuits (Raghavan et al. 2019).
- A `data scientist` currently has a lot of latitude to do whatever he wants, without control: from unfair, condemnable behavior to positive discrimination to introduce more equity in society!
- We can hope that, after the implementation of the RGPD, the adoption of the *AI Act* will impact these practices.
- A big job in sight for a *data scientist* in charge of a processing that will have to keep a complete and detailed documentation of all these procedures, from data collection to the monitoring of an artificial intelligence system in operation. 

**It is highly recommended to anticipate this upcoming regulation in order to justify the choices made, i.e. the least bad compromise involving data confidentiality, prediction accuracy, model interpretability and bias or discrimination risks.**



## References
Barocas S., Selbst A. (2016). [Big Data's Disparate Impact](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477899), *California Law Review*, 671.

Besse P. (2020). [Détecter, évaluer les risques des impacts discriminatoires des algorithmes d'IA](https://hal.archives-ouvertes.fr/hal-02616963), Contribution au séminaire Défenseur des Droits et CNIL, 28 mai 2020, article soumis.

Besse P. del Barrio E. Gordaliza P. Loubes J.-M., Risser L. (2021). [A survey of bias in Machine Learning through the prism of Statistical Parity for the Adult Data Set](https://doi.org/10.1080/00031305.2021.1952897), The American Statistician , DOI: 10.1080/00031305.2021.1952897 [Open access version](https://arxiv.org/pdf/2003.14263.pdf).

Efron B. (1987). [Better Bootstrap Confidence Intervals](https://www.jstor.org/stable/2289144?seq=1), *Journal of the American Statistical Association*, Vol. 82, No. 397 (Mar., 1987), pp. 171-185. 

Friedler S., Scheidegger C., Venkatasubramanian S., Choudhary S., Hamilton E., Roth D. (2019). [A comparative study of fairness-enhancing interventions in machine learning](https://dl.acm.org/doi/10.1145/3287560.3287589), *Proceedings of the Conference on Fairness, Accountability, and Transparency*.

Morris S., Lobsenz R. (2000), [Signifiance Tests and Confidence Intervals for the Adverse Impact Ratio](https://doi.org/10.1111/j.1744-6570.2000.tb00195.x) *Personnel Psychology*, 53: 89-111.

Riach, P., Rich J. (2002). [Field Experiments of Discrimination In The Market Place]( https://doi.org/10.1111/1468-0297.00080), *The Economic Journal*, Vol. 112, 480-518.

Raghavan M., Barocas S., Kleinberg J., Levy K. (2019) [Mitigating bias in Algorithmic Hiring : Evaluating Claims and Practices](https://arxiv.org/abs/1906.09208), arXiv:1906.09208.

Verma S., Rubin J. (2018). [Fairness Definitions Explained](http://fairware.cs.umass.edu/papers/Verma.pdf),  *ACM/IEEE International Workshop on Software Fairness*.

Zliobaitė I. (2015). [A survey on measuring indirect discrimination in machine learning](https://arxiv.org/pdf/1511.00148.pdf), arXiv preprint.
