# Example: Binary Discrete Choice Models as a Neural Network
For an individual, both observable and unobservable features can influence decisions. Utility functions for individuals which consider both observable and unobservable factors take the form:

\begin{equation}
U_{ij} = V_{ij} + \epsilon_{ij}
\end{equation}

The term $U_{ij}$ is the utility of alternative $j$ for individual $i$, $V_{ij}$ is the _deterministic_ component of the utility, 
i.e., the utility associated with the observable features, and $\epsilon_{ij}$ is the random component of the utility (error model). 

### Logit choice model
One of the most common choice models is the `Logit` model. Assume the random component of the utility $U_{ij}$ is _independently and identically distributed_ (IID) across $J$ alternatives, and is [Gumbel distributed](https://en.wikipedia.org/wiki/Gumbel_distribution), 
then the probability that individual $i$ chooses alternative $j$ is given by the [logit choice model](https://en.wikipedia.org/wiki/Discrete_choice):

\begin{equation}
P_{ij} = \frac{e^{V_{ij}/\mu}}{\displaystyle \sum_{k=1}^{J}e^{V_{ik}/\mu}}\qquad{j=1,\dotsc,J}
\end{equation}

where $P_{ij}$ is the probability that individual $i$ chooses alternative $j$, $V_{ij}$ is the deterministic component of the utility, and $\mu$ is a scale parameter.

### Learning Objectives
In this example, our goal is to calculate the probability of a binary choice between purchasing a Tesla Model S or a Honda Odyssey using an `Artificial Neural Network (ANN).` We will use the `Logit` choice model to generate training data, which we'll use to train an `ANN.` 


- Introducing students to Random Utility Models (RUMs) and the `Logit` discrete choice model and how we can use these models in our effort to train a discrete choice `ANN.`
- We'll use the [Flux.jl](https://fluxml.ai) machine learning package (loaded by the `Include.jl` file) to build the neural network models and estimate the unknown parameter vector $\theta$.

## Setup

In [1]:
include("Include.jl");

[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDecisionsPackage.jl.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Manifest.toml`
[32m[1m  Activating[22m[39m project at `~/Desktop/julia_work/CHEME-5760-Examples-F23`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDecisionsPackage.jl.git`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Manifest.toml`


## Data
The dataset we explore will be random perturbations of the Tesla versus Odyssey survey presented in class (in example `L2b`). 
* We load this dataset into the notebook using the `HondaTeslaDataSet()` function. This function stores the data in the `dataset` variable, a [DataFrame type](https://dataframes.juliadata.org/stable/).
* We'll then randomly perturb the values of the dataset to generate a `training` collection, and then train the model on this dataset.

In [2]:
dataset = HondaTeslaDataSet()

Row,feature,exponent,Tesla,Honda
Unnamed: 0_level_1,String15,Float64,Float64,Float64
1,sustainability,0.2,5.0,3.0
2,affordability,0.1,2.0,4.0
3,styling,0.05,5.0,2.0
4,usefulness,0.3,2.0,5.0
5,costownership,0.1,4.0,2.0
6,performance,0.05,5.0,1.0
7,safety,0.2,5.0,5.0


For the exponents, we'll use a [Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution) which is exported by the [Distributions.jl](https://github.com/JuliaStats/Distributions.jl) package:

In [3]:
exponent_distribution = Dirichlet(dataset[!,:exponent]);

In [4]:
number_of_traning_samples = 1000
number_of_features = 7
training_example_dictionary = Dict{Int64, DataFrame}();
for i ∈ 1:number_of_traning_samples
    
    # initialize a blank DataFrame -
    df = DataFrame();
    
    # generate an exponent vector -
    α = rand(exponent_distribution) .|> x-> round(x, digits=4)
    tesla_scores = rand(1.0:5.0,number_of_features);
    honda_scores = rand(1.0:5.0,number_of_features);
    
    for j ∈ 1:number_of_features
        row_data = (
            feature = dataset[j,:feature],
            exponent = α[j],
            Tesla = tesla_scores[j],
            Honda = honda_scores[j]
        );
        push!(df, row_data);
    end
    
    # grab the traning example -
    training_example_dictionary[i] = df;
end

In [5]:
training_example_dictionary[600]

Row,feature,exponent,Tesla,Honda
Unnamed: 0_level_1,String15,Float64,Float64,Float64
1,sustainability,0.0036,5.0,1.0
2,affordability,0.4881,5.0,5.0
3,styling,0.0135,4.0,4.0
4,usefulness,0.0013,5.0,4.0
5,costownership,0.0072,5.0,1.0
6,performance,0.0001,4.0,1.0
7,safety,0.4861,2.0,4.0


In [6]:
P_training_array = Array{Float64,1}();
for i ∈ 1:number_of_traning_samples
    dataset = training_example_dictionary[i];
    
    model = build(VLLogTransformedCobbDouglasUtilityFunction, (
        α = dataset[:,:exponent], b = ℯ)
    );
    
    V = zeros(2);
    V[1] = model(dataset[:,:Tesla]);
    V[2] = model(dataset[:,:Honda]);
    
    # compute P -
    p = exp(V[1])/(exp(V[1])+exp(V[2]))
    
    # capture -
    push!(P_training_array, p)
end

In [7]:
training_example_dictionary[600]

Row,feature,exponent,Tesla,Honda
Unnamed: 0_level_1,String15,Float64,Float64,Float64
1,sustainability,0.0036,5.0,1.0
2,affordability,0.4881,5.0,5.0
3,styling,0.0135,4.0,4.0
4,usefulness,0.0013,5.0,4.0
5,costownership,0.0072,5.0,1.0
6,performance,0.0001,4.0,1.0
7,safety,0.4861,2.0,4.0


In [8]:
P_training_array[600]

0.42088822712511653

## Build/Train the ANN decision agent
Now that we have `training data`, we can package this data into the proper format that is required by the [Flux.jl](https://fluxml.ai) machine learning package (loaded by the `Include.jl` file):

In [9]:
# initialize storage for labeled data for training -
training_data_vector = Vector{Tuple{Vector{Float32},Float32}}();
for i ∈ 1:number_of_traning_samples
    
    # grab the dataset
    dataset = training_example_dictionary[i];
    
    # pull the numerical data out, and convert to Float32
    input_data_vector = dataset[!,[:exponent,:Tesla,:Honda]] |> Matrix |> x-> reshape(x,21) .|> x->convert(Float32,x) 
    y = P_training_array[i] |> x-> convert(Float32,x); # output
    
    # package -
    data_tuple = (
        input_data_vector, y
    );
    push!(training_data_vector, data_tuple);
end

In [10]:
# build the model -
input_dimension = number_of_features*3;
#FFN_binary_choice = Chain(Dense(input_dimension, 10, σ), Dense(10, 1, σ));
FFN_binary_choice = Chain(Dense(input_dimension, 10, σ), Dense(10, 8, σ), Dense(8, 2, σ), Dense(2, 1, σ));

We'll use a loss function $L(\theta)$ of the form:

$$
L\left(\theta\right) = \frac{1}{n}\sum_{i\in\mathcal{D}}(\hat{y}_{i}\left(\theta\right) - y_{i})^2
$$

where $\hat{y}_{i}\left(\theta\right)$ denotes the estimated output, $y_{i}$ denotes the measured output, and $n$ denotes the number of training examples supplied to the network:

In [11]:
# setup a loss function -
loss(x, y) = Flux.Losses.mse(FFN_binary_choice(x), y; agg = mean);

In [12]:
# pointer to params -
θ = Flux.params(FFN_binary_choice)

Params([Float32[0.37141433 0.18259825 … 0.023898233 0.39544383; -0.02074041 0.42798302 … 0.367587 0.12993084; … ; 0.11345285 0.40677235 … -0.03570399 -0.37187222; -0.10114692 -0.11433351 … 0.17433405 0.124350466], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.38386762 -0.04346634 … -0.34651995 -0.28798708; -0.39293945 -0.55088156 … -0.43565732 0.2458203; … ; 0.20226407 0.13953368 … 0.32450053 -0.1600008; -0.44402814 -0.11905761 … -0.4961206 0.3090702], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.63689643 0.09243866 … 0.397788 -0.46923953; 0.59172684 -0.43352294 … 0.080537535 0.037623666], Float32[0.0, 0.0], Float32[-0.96165955 0.5270939], Float32[0.0]])

Next, let's specify the optimization approach the we'll use to estimate the unknown model parameters $\theta$. In particular, we'll use the [Momentum gradient descent algorithm](https://optimization.cbe.cornell.edu/index.php?title=Momentum): 
> Momentum is an extension to the gradient descent optimization algorithm that allows the search to build inertia in a direction in the search space and overcome the oscillations of noisy gradients and coast across flat spots of the search space

In [13]:
λ = 0.10;  # learning rate
β = 0.95; # momentum parameter
opt = Momentum(λ, β);

We'll specify the number of times we process the data (called an `epoch`) in the `number_of_epochs` variable. To run the gradient descent estimation algorithm, we'll call the `train!(...)` function exported by the [Flux.jl](https://fluxml.ai) package:

In [14]:
number_of_epochs = 1000;
for i = 1:number_of_epochs
    Flux.train!(loss, θ, training_data_vector, opt)
end

### What is the correct classification rate?
Now that we have the updated model parameter vector $\theta$, we can run a forward prediction of the `choice model` by calling our model with an input vector. Let's calculate the `training choice` and the `prediction choice`, and compute the frequency of a correct prediction:

In [15]:
score_array = Array{Int,1}();
for i ∈ 1:number_of_traning_samples
    x = training_data_vector[i][1];
    y = training_data_vector[i][2];
    ŷ = FFN_binary_choice(x)[1]
    
    correct_choice = y > 0.50 ? 'T' : 'H'
    predicted_choice = ŷ > 0.50 ? 'T' : 'H'
    if (correct_choice == predicted_choice)
        push!(score_array,1)
    else
        push!(score_array,0)
    end 
end

In [16]:
freq_correct = findall(x->x==1,score_array) |> length |> x-> x/number_of_traning_samples

0.807