# Example: Binary Discrete Choice Models as a Neural Network
For an individual, both observable and unobservable features can influence decisions. Utility functions for individuals which consider both observable and unobservable factors take the form:

\begin{equation}
U_{ij} = V_{ij} + \epsilon_{ij}
\end{equation}

The term $U_{ij}$ is the utility of alternative $j$ for individual $i$, $V_{ij}$ is the _deterministic_ component of the utility, 
i.e., the utility associated with the observable features, and $\epsilon_{ij}$ is the random component of the utility (error model). 

### Logit choice model
One of the most common choice models is the `Logit` model. Assume the random component of the utility $U_{ij}$ is _independently and identically distributed_ (IID) across $J$ alternatives, and is [Gumbel distributed](https://en.wikipedia.org/wiki/Gumbel_distribution), 
then the probability that individual $i$ chooses alternative $j$ is given by the [logit choice model](https://en.wikipedia.org/wiki/Discrete_choice):

\begin{equation}
P_{ij} = \frac{e^{V_{ij}/\mu}}{\displaystyle \sum_{k=1}^{J}e^{V_{ik}/\mu}}\qquad{j=1,\dotsc,J}
\end{equation}

where $P_{ij}$ is the probability that individual $i$ chooses alternative $j$, $V_{ij}$ is the deterministic component of the utility, and $\mu$ is a scale parameter.

### Learning Objectives
In this example, our goal is to calculate the probability of a binary choice between purchasing a Tesla Model S or a Honda Odyssey, using a `Artifical Neural Network (ANN)`. We will be using the `Logit` choice model to achieve to generate training data, which we'll use to train an `ANN`. 

- Introducing students to Random Utility Models (RUMs) and the `Logit` discrete choice model
- Familiarizing students with `Bernoulli` random variables and how to simulate binary choices
- Teaching students how to directly simulate Random Utility Models (RUMs) `Logit` discrete choice model by directly sampling the `Gumbel` distribution. 

By the end of this exercise, students should have a better understanding of these concepts and be able to apply them in real-world scenarios.

## Setup

In [1]:
include("Include.jl");

[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDecisionsPackage.jl.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Manifest.toml`
[32m[1m  Activating[22m[39m project at `~/Desktop/julia_work/CHEME-5760-Examples-F23`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDecisionsPackage.jl.git`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Manifest.toml`


## Data
The dataset we explore will be random perturbations of the Tesla versus Odyssey survey presented in class (in example `L2b`). 
* We load this dataset into the notebook using the `HondaTeslaDataSet()` function. This function stores the data in the `dataset` variable, a [DataFrame type](https://dataframes.juliadata.org/stable/).
* We'll then randomly perturb the values of the dataset to generate a `training` collection, and a `testing` collection

In [2]:
dataset = HondaTeslaDataSet()

Row,feature,exponent,Tesla,Honda
Unnamed: 0_level_1,String15,Float64,Float64,Float64
1,sustainability,0.2,5.0,3.0
2,affordability,0.1,2.0,4.0
3,styling,0.05,5.0,2.0
4,usefulness,0.3,2.0,5.0
5,costownership,0.1,4.0,2.0
6,performance,0.05,5.0,1.0
7,safety,0.2,5.0,5.0


In [3]:
exponent_distribution = Dirichlet(dataset[!,:exponent])

Dirichlet{Float64, Vector{Float64}, Float64}(alpha=[0.2, 0.1, 0.05, 0.3, 0.1, 0.05, 0.2])

In [4]:
number_of_traning_samples = 1000
number_of_features = 7
training_example_dictionary = Dict{Int64, DataFrame}();
for i ∈ 1:number_of_traning_samples
    
    # initialize a blank DataFrame -
    df = DataFrame();
    
    # generate an exponent vector -
    α = rand(exponent_distribution) .|> x-> round(x, digits=4)
    tesla_scores = rand(1.0:5.0,number_of_features);
    honda_scores = rand(1.0:5.0,number_of_features);
    
    for j ∈ 1:number_of_features
        row_data = (
            feature = dataset[j,:feature],
            exponent = α[j],
            Tesla = tesla_scores[j],
            Honda = honda_scores[j]
        );
        push!(df, row_data);
    end
    
    # grab the traning example -
    training_example_dictionary[i] = df;
end

In [5]:
training_example_dictionary[600]

Row,feature,exponent,Tesla,Honda
Unnamed: 0_level_1,String15,Float64,Float64,Float64
1,sustainability,0.326,4.0,4.0
2,affordability,0.6706,4.0,1.0
3,styling,0.0031,1.0,1.0
4,usefulness,0.0,4.0,5.0
5,costownership,0.0002,1.0,1.0
6,performance,0.0,2.0,1.0
7,safety,0.0,1.0,5.0


In [6]:
P_training_array = Array{Float64,1}();
for i ∈ 1:number_of_traning_samples
    dataset = training_example_dictionary[i];
    
    model = build(VLLogTransformedCobbDouglasUtilityFunction, (
        α = dataset[:,:exponent], b = ℯ)
    );
    
    V = zeros(2);
    V[1] = model(dataset[:,:Tesla]);
    V[2] = model(dataset[:,:Honda]);
    
    # compute P -
    p = exp(V[1])/(exp(V[1])+exp(V[2]))
    
    # capture -
    push!(P_training_array, p)
end

## Build/Train the ANN decision agent

In [7]:
# initialize storage for labeled data for training -
training_data_vector = Vector{Tuple{Vector{Float32},Float32}}();
for i ∈ 1:number_of_traning_samples
    
    # grab the dataset
    dataset = training_example_dictionary[i];
    
    # pull the numerical data out, and convert to Float32
    input_data_vector = dataset[!,[:exponent,:Tesla,:Honda]] |> Matrix |> x-> reshape(x,21) .|> x->convert(Float32,x) 
    y = P_training_array[i] |> x-> convert(Float32,x); # output
    
    # package -
    data_tuple = (
        input_data_vector, y
    );
    push!(training_data_vector, data_tuple);
end

In [8]:
# build the model -
input_dimension = number_of_features*3;
#FFN_binary_choice = Chain(Dense(input_dimension, 10, σ), Dense(10, 1, σ));
FFN_binary_choice = Chain(Dense(input_dimension, 10, σ), Dense(10, 8, σ), Dense(8, 2, σ), Dense(2, 1, σ));

In [9]:
# setup a loss function -
loss(x, y) = Flux.Losses.mse(FFN_binary_choice(x), y; agg = mean)

loss (generic function with 1 method)

In [10]:
# pointer to params -
θ = Flux.params(FFN_binary_choice)

Params([Float32[-0.15398169 -0.27920803 … 0.3063056 -0.25972453; -0.28658324 0.39146098 … -0.3077912 -0.16072471; … ; 0.33179227 0.29942274 … 0.07458144 0.280759; -0.106863074 -0.33519754 … 0.35353678 -0.0059852963], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.20281586 -0.13127875 … -0.31848973 0.39112967; 0.45615515 0.17447606 … -0.025403 -0.15856344], Float32[0.0, 0.0], Float32[-0.9686101 -0.0077899178], Float32[0.0]])

Next, let's specify the optimization approach the we'll use to estimate the unknown model parameters $\theta$. In particular, we'll use the [Momentum gradient descent algorithm](https://optimization.cbe.cornell.edu/index.php?title=Momentum): 
> Momentum is an extension to the gradient descent optimization algorithm that allows the search to build inertia in a direction in the search space and overcome the oscillations of noisy gradients and coast across flat spots of the search space

In [11]:
λ = 0.10;  # learning rate
β = 0.95; # momentum parameter
opt = Momentum(λ, β);

We'll specify the number of times we process the data (called an `epoch`) in the `number_of_epochs` variable. To run the gradient descent estimation algorithm, we'll call the `train!(...)` function exported by the [Flux.jl](https://fluxml.ai) package:

In [12]:
number_of_epochs = 1000;
for i = 1:number_of_epochs
    Flux.train!(loss, θ, training_data_vector, opt)
end

### What is the correct classification rate?

In [13]:
score_array = Array{Int,1}();
for i ∈ 1:number_of_traning_samples
    x = training_data_vector[i][1];
    y = training_data_vector[i][2];
    ŷ = FFN_binary_choice(x)[1]
    
    correct_choice = y > 0.50 ? 'T' : 'H'
    predicted_choice = ŷ > 0.50 ? 'T' : 'H'
    if (correct_choice == predicted_choice)
        push!(score_array,1)
    else
        push!(score_array,0)
    end 
end

In [14]:
freq_correct = findall(x->x==1,score_array) |> length |> x-> x/number_of_traning_samples

0.794