# Install and load required packages

In [None]:
Pkg.add("Dataframes")
Pkg.add("Optim") # For L-BFGS <https://github.com/JuliaOpt/Optim.jl#basic-api-introduction>

In [1]:
using DataArrays, DataFrames

TODO: Load the original Adult dataset CSV and preprocess it right here
- one-hot encoding

# Load the "Adult" dataset
The Adult dataset is from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/).
I have preprocessed the data. In categorica data missing data is handled as just another category.

In [2]:
#df_data = readtable("data/adult.data", nastrings = ["", "NA", "?"]);
#df_test = readtable("data/adult.test", nastrings = ["", "NA", "?"]);
df_data = readtable("data/adult.processed.data");
df_test = readtable("data/adult.processed.test");
names(df_data)

15-element Array{Symbol,1}:
 :age           
 :workclass     
 :fnlwgt        
 :education     
 :education_num 
 :marital_status
 :occupation    
 :relationship  
 :race          
 :sex           
 :capital_gain  
 :capital_loss  
 :hours_per_week
 :native_country
 :classification

In [3]:
df_test[ 1:3, :age ] # Selecting subset of rows, by column name

3-element DataArrays.DataArray{Int64,1}:
 25
 38
 28

# Set parameters
$K$ will be set to the number of prototypes

In [4]:
# Size or original matrix -1 for target column and -1 for sensitive column
sensitive_column_name = :sex # Sensitive column is "gender" as in the paper
classification_column_name = :classification
K = size(df_data,2) - 2

13

# Auxiliary helper functions

In [89]:
# A nice Unicode named summation function
‚àë(from::Integer, to::Integer, inner::Function) = sum(inner, colon(from,to))
# Test:
#f(x, k) = x*k
#‚àë(1, 3, (k) -> f(1, k))

# Partition a matrix two, according to given indices or indicator vector.
# Your matrices need to be column-major, as this is the Julia memory layout.
function partition{T<:Integer}(x::Vector, indices::Vector{T})
    return x[ indices ], x[ setdiff(1:length(X), indices) ]
end
function partition{T<:Integer}(X::Matrix, indices::Vector{T})
    return X[ :, indices ], X[ :, setdiff(1:size(X,2), indices) ]
end
function partition{T<:Integer,U<:Any}(X::SharedArray{U,2}, indices::Vector{T})
    return X[ :, indices ], X[ :, setdiff(1:size(X,2), indices) ]
end
function partition{T<:Bool}(x::Vector, indicator::Vector{T})
    return partition(x, find(indicator))
end
function partition{T<:Bool}(X::Matrix, indicator::Vector{T})
    return partition(X, find(indicator))
end
function partition{T<:Bool,U<:Any}(X::SharedArray{U,2}, indicator::Vector{T})
    return partition(X, find(indicator))
end
# Test:
# @which partition([:first, :second, :third, :fourth], [true, false, true, false])
# @which partition([:first, :second, :third, :fourth], [1,3])

partition (generic function with 8 methods)

# Code for the model

## Definitions
We will differ a bit from the definitions in the paper. Notably in the definition of the random variables $Z$. The definitions we use are: TODO MAKE THESE MATCH THE PAPER, THEN TELL WHAT IS CHANGED
- $\mathbf{X}$ denotes the entire data set, a $(N \times D)$ matrix. The rows of the matrix are the feature vectors $\mathbf{x}_n$ representing attributes of an individual. $\mathbf{X}$ contains neither the classification information (target) column, nor the sensitive column.
- $S$ is a binary random variable representing whether or not a given individual is a member of the "protected set". For the user of the algorithm, this is a decision that is done before running the algorithm.
- $\mathbf{X}_{train}$ denotes the training set.
- $\mathbf{X}_{test}$ denotes the test set.
- $\mathbf{X}^+$ denotes the subset of individuals that are members of the "protected group" i.e. individuals for whom $S=1$. Similarly $\mathbf{X}^-$ denotes the subset of individuals for whom $S=0$.
- Define $\mathbf{X}_{train}^+$, $\mathbf{X}_{train}^-$, $\mathbf{X}_{test}^+$ and $\mathbf{X}_{test}^-$ similarly as above.
- $d$ is a distance measure on $\mathbf{X}$ (e.g. euclidean distance).
- $Z$ is a random integer from the set $\left\{1,\dots,K\right\}$.
- $Y$ is a binary random variable representing the classification decision (we consider binary classification only).

Let $Z$ be a random integer from the set $\left\{1,\dots,K\right\}$. Now we can denote the probability that a datapoint $\mathbf{x}$ maps to a particular prototype $k$ with $\mathbb{P}(Z=k \mid \mathbf{x})$ i.e. given the datapoint $\mathbf{x}$, the probability that $Z$, the index of the prototype, is $k$.

## Definitions in code

From the definitions above, we map some to code and also define additional stuff.

Julia is Column-Majored, so our matrices will be altered accordingly.

- $\mathbf{X}$ is just `ùêó`, but $(D \times N)$ instead of $(N \times D)$.
- $\mathbf{X}_{train}$ is `ùêótrain`, $\mathbf{X}_{test}$ is `ùêótest`
- Grouped versions are `ùêó‚Å∫train`, `ùêó‚Åªtrain`, `ùêó‚Å∫test`, and `ùêó‚Åªtest` respectively
- $S$ maps to different vectors `S_<someset>` containing the sensitive column for `<someset>`, e.g. `S_ùêó`
- $d$ is defined as a lambda function `d`
- $Z$ is replaced by $\mathbf{Z}$, a matrix of probability vectors $\mathbf{z}$
- $Y$ TODO

Additionally, denote the tuple of prototypes $\mathbf{V} = \left(\mathbf{v}_1,...,\mathbf{v}_K\right)$. Since a single prototype $\mathbf{v}_k$ is a vector of length $D$, $\mathbf{V}$ can be expressed as a ($D \times K$) matrix. This is our optimization variable.

In [6]:
# Symbols: ùêó ‚Å∫ ‚Åª ‚àë ùêï ùê±

# Sensitive indices for training and test data
S_ùêótrain = convert(Array{Bool}, df_data[ sensitive_column_name ])
S_ùêótest = convert(Array{Bool}, df_test[ sensitive_column_name ])

# Classification vectors for training and test data
ùê≤train = convert(Array, df_data[ classification_column_name ])
ùê≤test = convert(Array, df_test[ classification_column_name ])

# Drop sensitive and classification columns
idxs_left = setdiff(names(df_data), [sensitive_column_name, classification_column_name])
ùêótrain = transpose(convert(Matrix, df_data[idxs_left]))
ùêótest = transpose(convert(Matrix, df_test[idxs_left]))

# Standardize the features with mean 0 variance 1
# TODO: tell why, exponentiation goes quickly out of hand, e^-800 is already NaN on Float64
train_mean = mean(ùêótrain,2)[:]
train_var = var(ùêótrain,2)[:]
ùêótrain = ùêótrain .- train_mean # substraction of vector
ùêótrain = ùêótrain ./ train_var # division by vector
ùêótest = ùêótest .- train_mean # same mean as in training, on purpose
ùêótest = ùêótest ./ train_var # same variance as in training, on purpose

#ùêótest .- train_mean

# Reconstruct full dataset
ùêó = hcat(ùêótrain, ùêótest)
S_ùêó = vcat(S_ùêótrain, S_ùêótest)
ùê≤ = vcat(ùê≤train, ùê≤test)

# Dimensions
D = size(ùêó, 1)
N = size(ùêó, 2)
Ntrain = size(ùêótrain, 2)
Ntest = size(ùêótest, 2)

### Distance function
# Lambda
d = (ùêö::Vector, ùêõ::Vector) -> vecnorm(ùêö - ùêõ) # Euclidean distance
# Non-lambda is slightly faster for calculations, but has to be defined
# for all processes with @everywhere
@everywhere de(ùêö::Vector{Float64}, ùêõ::Vector{Float64}) = vecnorm(ùêö - ùêõ)

### Optimization variables
# Main optimization variable, matrix holding the prototype vectors.
# Initialized to random Float64 matrix, normal distribution 0 mean 1 variance
ùêï = randn(D, K)
# Weights or "prototype label predictions" (probabilities)
ùê∞ = rand(K) # floats in [0,1)

### Hyperparameters
A = Dict(:z=> 1000, :x=> 0.0001, :y=> 0.1)

print("Done.")

Done.

In [7]:
maximum(ùêótrain), minimum(ùêótrain), maximum(ùêótest), minimum(ùêótest), maximum(ùêï), minimum(ùêï)

(1.9488022835672973,-5.088141178187719,1.9488022835672973,-5.088141178187719,2.745576743598197,-2.860178737087543)

Divide training and test sets to groups according to whether the individuals are "protected" or not.

In [8]:
(ùêó‚Å∫train, ùêó‚Åªtrain) = partition(ùêótrain, S_ùêótrain)
(ùêó‚Å∫test, ùêó‚Åªtest) = partition(ùêótest, S_ùêótest)
"Training:",size(ùêó‚Å∫train), size(ùêó‚Åªtrain), "Test:", size(ùêó‚Å∫test), size(ùêó‚Åªtest)

("Training:",(13,21790),(13,10771),"Test:",(13,10860),(13,5421))

## Mapping $\mathbf{X} \rightarrow \mathbf{Z}$
Now we can define a mapping from the original dataset $\mathbf{X}$ to probabilities via the *softmax function*. [Wikipedia](https://en.wikipedia.org/wiki/Softmax_function):
> Softmax function "squashes" a $K$-dimensional vector $\mathbf{z}$ of arbitrary real values to a $K$-dimensional vector $\sigma(\mathbf{z})$ of real values in the range $[0, 1]$ that add up to 1.

Most notably `softmax` returns a probability vector. We will define a modified version that maps a $D$-dimensional vector $\mathbf{x}$ to a $K$-dimensional vector $\sigma(\mathbf{x})$, i.e. the mapping won't necessarily preserve the dimensionality of $\mathbf{x}$.

Also from [Wikipedia](https://en.wikipedia.org/wiki/Multinomial_logistic_regression):
> $$\operatorname{softmax}(k,x_1,\ldots,x_n) = \frac{e^{x_k}}{\sum_{i=1}^n e^{x_i}}$$
> is referred to as the [*softmax function*](https://en.wikipedia.org/wiki/Softmax_function).  The reason is that the effect of exponentiating the values $x_1,\ldots,x_n$ is to exaggerate the differences between them.  As a result, $\operatorname{softmax}(k,x_1,\ldots,x_n)$ will return a value close to 0 whenever $x_k$ is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value.  Thus, the softmax function can be used to construct a weighted average that behaves as a smooth function (which can be conveniently differentiated, etc.) and which approximates the [indicator function](https://en.wikipedia.org/wiki/Indicator_function).

So we define, as in the paper equation (2), $$\mathbb{P}(Z=k \mid \mathbf{x}) = \frac{e^{-d(\mathbf{x}, \mathbf{v}_k)}}{\sum_{j=1}^K e^{-d(\mathbf{x}, \mathbf{v}_j)}}$$
where
- $\mathbb{P}(Z=k \mid \mathbf{x})$ is described in [definitions](#Definitions)
- $\mathbf{x}$ is the datapoint
- $\mathbf{v}_k$ is a vector associated with the $k$th prototype
- $d$ is a distance measure between $\mathbf{x}$ and $\mathbf{v}_k$ (e.g. the euclidean distance)

This means that since we have replaced $x_k$ in the softmax with the negative distance between $\mathbf{x}$ and prototype $\mathbf{v}_k$, the softmax returns a value close to 0 whenever the distance from $\mathbf{x}$ to the prototype $\mathbf{v}_k$ is significantly higher than $\min_{j\in\{1,\dots,K\}}\, d(\mathbf{x}, \mathbf{v}_i)$, and close to 1 when applied to the minimum value.

"Mapping from X to Z" in the paper means mapping the vector $\mathbf{x}$ to a probability vector $\mathbf{z}$ of length $K$ via the softmax function. These probability vectors are then used directly for training the classifier.

**In code** this means that $\mathbb{P}(Z=k \mid \mathbf{x})$ is represented by a function taking
- the data point $\mathbf{x}$ which is a `Vector` of length $D$
- $(D \times K)$ `Matrix` of prototypes $\mathbf{V}$ containing all $K$ prototypes $\mathbf{v}_k$ i.e. each prototype is a `Vector` (remember, Column-Major order on matrices)
- a distance measure on $\mathbf{X}$

and returning
- a `Vector` $\mathbf{z}$ of length $K$ representing a [probability vector](https://en.wikipedia.org/wiki/Probability_vector), where each value $z_k$ of the probability vector $\mathbf{z}$ tells how probable it is that $\mathbf{x}$ maps to $\mathbf{v}_k$. Since $\mathbf{z}$ is a probability vector, $\sum_{i=k}^K z_k = 1$.

We will name this function `softmax` and define it as follows:

In [9]:
# TODO: explain, Give in ùêó, get ùêô; give in ùê±, get ùê≥
# TODO: clean up duplicate code

### FOR VECTORS

# This is same as Eq (2) in paper.
function softmax_dist{T<:Number}(ùê±::Vector{T}, ùêï::Matrix{T}, distanceMeasure::Function)
    K = size(ùêï, 2)
    res = Vector{Float64}(K)
    denominator = Float64(0.0)
    # Use one loop to calculate both numerator and denominator
    for k in 1:K
        res[k] = exp(- distanceMeasure(ùê±, ùêï[:,k]) )
        denominator += res[k]
    end
    denom = inv(denominator)
    res .* denom
end

function softmax_euclidean{T<:Number}(ùê±::Vector{T}, ùêï::Matrix{T})
    K = size(ùêï, 2)
    res = Vector{Float64}(K)
    denominator = Float64(0.0)
    # Use one loop to calculate both numerator and denominator
    for k in 1:K
        res[k] = exp(- vecnorm(ùê± - ùêï[:,k]) )
        denominator += res[k]
    end
    denom = inv(denominator)
    res .* denom
end

### FOR MATRICES

function softmax_dist{T<:Number}(ùêó::Matrix{T}, ùêï::Matrix{T}, distanceMeasure::Function)
    N = size(ùêó, 2)
    K = size(ùêï, 2)
    # Preallocate result matrix, no need to zero it
    res = Matrix{Float64}(K, N)
    for n in 1:N
        res[:,n] = softmax_dist(ùêó[:,n], ùêï, distanceMeasure)
    end
    res
end

# Parallel version
function softmax_dist_par{T<:Number}(ùêó::Matrix{T}, ùêï::Matrix{T}, distanceMeasure::Function)
    nprocs()==CPU_CORES || addprocs(CPU_CORES-1)    
    N = size(ùêó, 2)
    K = size(ùêï, 2)
    # Preallocate result matrix, no need to zero it
    res = SharedArray(Float64, (K, N))
    @sync @parallel for n in 1:N
        for k in 1:K
            res[k,n] = exp(- distanceMeasure(ùêó[:,n], ùêï[:,k]) )
        end
        res[:,n] = res[:,n] .* inv(sum(res[:,n]))
    end
    res
end

function softmax_euclidean{T<:Number}(ùêó::Matrix{T}, ùêï::Matrix{T})
    N = size(ùêó, 2)
    K = size(ùêï, 2)
    # Preallocate result matrix, no need to zero it
    res = Matrix{Float64}(K, N)
    for n in 1:N
        denominator = Float64(0.0)
        # Use one loop to calculate both numerator and denominator
        for k in 1:K
            res[k,n] = exp(- vecnorm(ùêó[:,n] - ùêï[:,k]) )
            denominator += res[k,n]
        end
        res[:,n] = res[:,n] .* inv(denominator)
    end
    res
end

# Parallel version
function softmax_euclidean_par{T<:Number}(ùêó::Matrix{T}, ùêï::Matrix{T})
    nprocs()==CPU_CORES || addprocs(CPU_CORES-1)
    N = size(ùêó, 2)
    K = size(ùêï, 2)
    # Preallocate result matrix, no need to zero it
    res = SharedArray(Float64, (K, N))
    @sync @parallel for n in 1:N
        for k in 1:K
            res[k,n] = exp(- vecnorm(ùêó[:,n] - ùêï[:,k]) )
        end
        res[:,n] = res[:,n] .* inv(sum(res[:,n]))
    end
    res
end

softmax_euclidean_par (generic function with 1 method)

In [10]:
# For development
ùêôpre = softmax_euclidean_par(ùêótrain, ùêï)
ùêôpre‚Å∫ = softmax_euclidean_par(ùêó‚Å∫train, ùêï)
ùêôpre‚Åª = softmax_euclidean_par(ùêó‚Åªtrain, ùêï);

This approach is akin to using a funky [multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) to "predict the prototype" (category) where a data point $\mathbf{x}$ maps to.
Wikipedia:
> These are all statistical classification problems. They all have in common a dependent variable to be predicted that comes from one of a limited set of items which cannot be meaningfully ordered, as well as a set of independent variables (also known as features, explanators, etc.), which are used to predict the dependent variable. Multinomial logit regression is a particular solution to the classification problem that assumes that a linear combination of the observed features and some problem-specific parameters can be used to determine the probability of each particular outcome of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data.


## Parts of the optimization objective

### $L_z$ &mdash; statistical parity
In code, we will denote $L_z$ with function `Lz`.

In [88]:
### "NAIVE" VERSION FOR UNDERSTANDING THE IMPLEMENTATION

function LzNaive{T<:Number}(ùêó‚Å∫::Matrix{T}, ùêó‚Åª::Matrix{T}, ùêï::Matrix{T}, dist::Function)
    # Operate on matrices and take mean from sample dimension N
    meanp = mean( softmax_dist_par(ùêó‚Å∫, ùêï, dist), 2 ) # Eq (6)
    meann = mean( softmax_dist_par(ùêó‚Åª, ùêï, dist), 2 ) # Similarly for M_k^-
    sum(abs(meanp - meann)) # Eq (7), sum is from k=1 to K
end

### VERSIONS TAKING IN THE DATA SET AND PROTOTYPES
# Optionally a distance measure function can be passed as an argument

function Lz{T<:Number}(ùêó‚Å∫::Matrix{T}, ùêó‚Åª::Matrix{T}, ùêï::Matrix{T})
    return Lz(softmax_euclidean_par(ùêó‚Å∫, ùêï), softmax_euclidean_par(ùêó‚Åª, ùêï))
end

function Lz{T<:Number}(ùêó‚Å∫::Matrix{T}, ùêó‚Åª::Matrix{T}, ùêï::Matrix{T}, dist::Function)
    return Lz(softmax_dist_par(ùêó‚Å∫, ùêï, dist), softmax_dist_par(ùêó‚Åª, ùêï, dist))
end

### VERSION FOR PRECALCULATED ùêô‚Å∫ and ùêô‚Åª
# Note we have only a version for matrices. This is because during performance
# testing I noticed that
#
#   ZZZp = sdata(ùêôshared‚Å∫)
#   ZZZn = sdata(ùêôshared‚Åª)
#   Lz(ZZZp, ZZZn)
#
# is faster than
#
#   Lz(ùêôshared‚Å∫, ùêôshared‚Åª)
#
# So use the Matrix version always and if necessary lift the matrices out
# of the SharedArray with sdata().
#
# TODO: is there way to make a faster parallel version?

function Lz{T<:Number}(ùêô‚Å∫::Matrix{T}, ùêô‚Åª::Matrix{T})
    # Operate on matrices and take mean from sample dimension N
    meanp = mean( ùêô‚Å∫, 2 ) # Eq (6)
    meann = mean( ùêô‚Åª, 2 ) # Similarly for M_k^-
    sum(abs(meanp - meann)) # Eq (7), sum is from k=1 to K
end

Lz (generic function with 4 methods)

In [42]:
# Test
LZtrain = Lz(ùêó‚Å∫train, ùêó‚Åªtrain, ùêï)

0.17883943938377728

### $L_x$ &mdash; information loss
In code, we will denote $L_x$ with function `Lx`.

In [14]:
# Symbols: ùêó ‚Å∫ ‚Åª ‚àë ùêï ùê± ùê≤

Note .* elementwise multiplication of softmax_dist() and V, there is no \cdot in the paper in Eq (9), dot product would return a scalar.

In [48]:
### NAIVE VERSION FOR UNDERSTANDING THE IMPLEMENTATION

function LxNaive{T<:Number,U<:Number}(ùêó::Matrix{T}, ùêï::Matrix{U}, dist::Function)
    D = size(ùêó, 1)
    N = size(ùêó, 2)
    K = size(ùêï, 2)
    ùêóhat = zeros(Float64, (D,N))
    sum = Float64(0.0)
    for n in 1:N
        for k in 1:K # Eq (9)
            ùêóhat[:,n] = ùêóhat[:,n] + (softmax_dist(ùêó[:,n], ùêï, dist) .* ùêï[:,k])
        end
        sum += (ùêó[:,n] - ùêóhat[:,n]) ‚ãÖ (ùêó[:,n] - ùêóhat[:,n]) # Eq (8)
    end
    return sum, ùêóhat
end

### VERSIONS TAKING IN THE DATA SET AND PROTOTYPES
# Optionally a distance measure function can be passed as an argument

function Lx{T<:Number,U<:Number}(ùêó::Matrix{T}, ùêï::Matrix{U})
    return Lx(softmax_euclidean_par(ùêó, ùêï), ùêó, ùêï)
end

function Lx{T<:Number,U<:Number}(ùêó::Matrix{T}, ùêï::Matrix{U}, dist::Function)
    return Lx(softmax_dist_par(ùêó, ùêï, dist), ùêó, ùêï)
end

### VERSION FOR PRECALCULATED ùêô

function Lx{T<:Number}(ùêô::SharedArray{T,2}, ùêó::Matrix{T}, ùêï::Matrix{T})
    D = size(ùêï, 1)
    K = size(ùêï, 2)
    N = size(ùêô, 2)
    # Keep ùêô as SharedArray, will be faster than taking sdata() when fed to the following @parallel loop
    sum = @parallel (+) for n in 1:N # Eq (8)
        ùê±hat_n = zeros(Float64, D)
        for k in 1:K # Eq (9)
            ùê±hat_n = ùê±hat_n + (ùêô[:,n] .* ùêï[:,k]) # We are constructing a vector of length D
        end
        (ùêó[:,n] - ùê±hat_n) ‚ãÖ (ùêó[:,n] - ùê±hat_n) # "simple squared error"
    end
    return sum
end

# TODO: test if making V into sharedarray increases performance, probably not since V is usually small

Lx (generic function with 3 methods)

In [52]:
# Test
LXtrain = Lx(ùêótrain, ùêï)

182684.39602519403

### $L_y$ &mdash; prediction accuracy
In code, we will denote $L_y$ with function `Ly`.

Essentially here we are letting the optimization pick both the prototypes (i.e. feature vectors) and their predictions (i.e. labels), and the predictions don't have to be discrete 0 and 1, but can be from the range $[0,1]$ and thus themselves can be viewed as probabilities. E.g. let's say that for prototype $v_k$ it's prediction $w_k = 0.82$, then "there is a 82% chance prototype $\mathbf{v}_k$ gets label 1"

In [17]:
### NAIVE VERSION TO HELP UNDERSTAND THE IMPLEMENTATION

function LyNaive{T1<:Number,T2<:Number,T3<:Number}(
        ùêó::Matrix{T1}, ùêï::Matrix{T1}, ùê≤::Vector{T2}, ùê∞::Vector{T3}, dist::Function
    )
    D = size(ùêó, 1)
    N = size(ùêó, 2)
    K = size(ùêï, 2)
    ùê≤hat = zeros(Float64, N)
    sum = Float64(0.0)
    # Replace ùê≤hat in Eq (10) with Eq (11), then you get this for loop
    for n in 1:N
        ùêô_n = softmax_dist(ùêó[:,n], ùêï, dist) # Vector of length K
        for k in 1:K
            ùê≤hat[n] = ùê≤hat[n] + (ùêô_n[k] * ùê∞[k])
        end
        # The following line could be replaced with
        # if ùê≤[n] == 1
        #    sum -= log(ùê≤hat[n])
        # else # ùê≤[n] == 0
        #    sum -= log(1 - ùê≤hat[n])
        # end
        sum += -ùê≤[n] * log(ùê≤hat[n])  -  (1 - ùê≤[n]) * log(1 - ùê≤hat[n])
    end
    #return sum, ùê≤hat
    return sum
end

### VERSIONS TAKING IN THE DATA SET AND PROTOTYPES
# Optionally a distance measure function can be passed as an argument

function Ly{T1<:Number,T2<:Number,T3<:Number}(
        ùêó::Matrix{T1}, ùêï::Matrix{T1}, ùê≤::Vector{T2}, ùê∞::Vector{T3}
    )
    # ùêô = softmax_euclidean_par(ùêó, ùêï)
    return Ly(softmax_euclidean_par(ùêó, ùêï), ùê≤, ùê∞)
end

function Ly{T1<:Number,T2<:Number,T3<:Number}(
        ùêó::Matrix{T1}, ùêï::Matrix{T1}, ùê≤::Vector{T2}, ùê∞::Vector{T3}, dist::Function
    )
    # ùêô = softmax_dist_par(ùêó, ùêï, dist)
    return Ly(softmax_dist_par(ùêó, ùêï, dist), ùê≤, ùê∞)
end

### VERSION FOR PRECALCULATED ùêô

# function Ly{T1<:Number,T2<:Number,T3<:Number}(
#         ùêô::Matrix{T1}, ùê≤::Vector{T2}, ùê∞::Vector{T3}
#     )
#     # Copy to shared memory
#     ùêôshared = convert(SharedArray{T1, 2}, ùêô)
#     return Ly(ùêôshared, ùê≤, ùê∞)
# end

function Ly{T1<:Number,T2<:Number,T3<:Number}(
        ùêô::SharedArray{T1,2}, ùê≤::Vector{T2}, ùê∞::Vector{T3}
    )
    N = size(ùêô, 2)
    # Keep ùêô as SharedArray, will be faster than taking sdata() when fed to the following @parallel loop
    sum = @parallel (+) for n in 1:N # Eq (10)
        yhat_n = ùêô[:,n] ‚ãÖ ùê∞ # Eq (11)
        - ùê≤[n] * log(yhat_n) - (1 - ùê≤[n]) * log(1 - yhat_n)
    end
    return sum
end

Ly (generic function with 4 methods)

In [18]:
# Test
LYtrain = Ly(ùêótrain, ùêï, ùê≤train, ùê∞)

26076.52644269116

## Optimization objective function

In [100]:
# Overall objective function
objective_euclidean(ùêó, ùêó‚Å∫, ùêó‚Åª, ùêï, ùê≤, ùê∞, A) = A[:z]*Lz(ùêó‚Å∫, ùêó‚Åª, ùêï) + A[:x]*Lx(ùêó, ùêï) + A[:y]*Ly(ùêó, ùêï, ùê≤, ùê∞)
objective_dist(ùêó, ùêó‚Å∫, ùêó‚Åª, ùêï, ùê≤, ùê∞, A, dist) = A[:z]*Lz(ùêó‚Å∫, ùêó‚Åª, ùêï, dist) + A[:x]*Lx(ùêó, ùêï, dist) + A[:y]*Ly(ùêó, ùêï, ùê≤, ùê∞, dist)
function objective_pre(ùêó::Matrix, S::Vector{Bool}, ùêï::Matrix, ùê≤::Vector, ùê∞::Vector, A::Dict)
    # Calculate ùêô and partitions once
    ùêô = softmax_euclidean_par(ùêó, ùêï)
    (ùêô‚Å∫, ùêô‚Åª) = partition(ùêô, S)
    # Use functions that accept precalculated ùêô
    return A[:z]*Lz(ùêô‚Å∫, ùêô‚Åª) + A[:x]*Lx(ùêô, ùêó, ùêï) + A[:y]*Ly(ùêô, ùê≤, ùê∞)
end

objective_pre (generic function with 3 methods)

In [93]:
objective_euclidean(ùêótrain, ùêó‚Å∫train, ùêó‚Åªtrain, ùêï, ùê≤train, ùê∞, A)

117418.72624368257

In [101]:
objective_pre(ùêótrain, S_ùêótrain, ùêï, ùê≤train, ùê∞, A)

117418.72624368257

In [102]:
# @time for i in 1:10
#     objective_euclidean(ùêótrain, ùêó‚Å∫train, ùêó‚Åªtrain, ùêï, ùê≤train, ùê∞, A)
# end
# @time for i in 1:10
#     objective_pre(ùêótrain, S_ùêótrain, ùêï, ùê≤train, ùê∞, A)
# end
#4.119981 seconds (1.24 M allocations: 95.438 MB, 0.42% gc time)
#2.420592 seconds (599.02 k allocations: 87.681 MB, 0.72% gc time)

  2.315172 seconds (596.71 k allocations: 87.516 MB, 0.69% gc time)


In [21]:
objective_euclidean(ùêótest, ùêó‚Å∫test, ùêó‚Åªtest, ùêï, ùê≤test, ùê∞, A)

1489.5333648228848

## Test optimization run

In [None]:
# Call L-BFGS

# Optimization, running the algorithm
Hyperparameters for the objective function.
In the paper they use grid search to find the parameters. The sets defined here are the same as in the paper.

In [22]:
# Sets of hyperparameters as in paper, for grid search
gridA = Dict(:z => Set([0.1, 0.5, 1.0, 5.0, 10.0]), :x => Set([0, 0.01]), :y => Set([0.1, 0.5, 1.0, 5.0, 10.0]))
# An example of selected hyperparameters, for development
A = Dict(:z => 0.01, :x => 0.5, :y => 1.0)

Dict{Symbol,Float64} with 3 entries:
  :y => 1.0
  :z => 0.01
  :x => 0.5

# TODO
- Overall process with pictures

# Problems/Cons/Notes:
- Let's say that there is a column/feature "Religion" in the dataset.
- Now this paper says we can only say that "Is a member of protected group" or "Is not a member of protected group".
- You have to decide what is the "protected" group, and what is the "normal/non protected" group. You have to decide based on some external criteria who are discriminated against and who are not.
- Let's say we have a dataset with a feature "Religion" and we have 5 different religions represented.
- Now we have to choose which ones are protected and which ones are not.
- The problem of course is that some might be in general discriminated against more than others. There is not necessary even split between the different groups that are discriminated against.

- What we would like to say is that "Religion" is a sensitive feature, and we should not infer _anything_ from it, regardless what it is.

- Does running the algo multiple times help, changing the binary classification each time? Can we extend it so that $S \in {1,...,C}$ where $C$ is the number of categories in the sensitive column.
  - We can extend, just split $L_z$ to multiple cases and the optimization is done to all of them. There will be $c = \frac{(C-1)C}{2} \approx O(C^2)$ pairs. Whether this is computationally still feasible is another question. In the objective function $L_z$ is replaced by $A_{z_1} \cdot L_{z_1} + A_{z_2} \cdot L_{z_2} + \dots + A_{z_c} \cdot L_{z_c}$.

- On the current case where $S \in \left\{0,1\right\}$ once we have set for which rows $S=1$ and $S=0$, we can flip them around without changing anything. This is because we are using statistical parity. This means that from the algorithm's perspective saying that group0 is non-protected and groups 1..4 are protected is the same as saying group1 is protected and other non-protected.