# L16b: Let's build a Deep Q-learning (DQN) Agent
In this lab, we'll look at a Deep Q-learning (DQN) agent whose objective is to learn to mix $K$ different materials to maximize the benefit of the mixture. 

### Tasks
Before you start, execute the `Run All Cells` command to check if you have any code or setup issues - let's get those fixed!

* __Task 1: Setup, Data, Constants__: In this task, we set up the computational environment, load the necessary packages, and prepare the `world(...)` function for our personal shopper problem. We will also define any constants we use throughout the problem set.
* __Task 2: Build the Context Models__:In this task, we will build several models of the contextual information used to inform the agent's recommendations. These models, which are [instances of the `MyBanditConsumerContextModel` type](src/Types.jl), hold various parameters that will be used in the `world(...)` function that we developed in Task 1.
* __Task 3: Evaluation of Scenarios__: In this task, we'll run different context models to evaluate how well our agent performs under various scenarios. We will use the same bandit algorithm in all cases but vary the context model to see how it influences the agent's decisions and performance. We display the results, and ask you a few discussion questions.

Tests throughout the notebook (and at the bottom section) help you determine if things are running correctly. Let's go! (Remember to answer the discussion questions.)
___

## Task 1: Setup, Data, Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [1]:
include("Include.jl");

Next, let's build the `world(...)` function. 
* The `world(...)` function takes the action vector `a::Array{Int64,1}` where the elements of `a::Array{Int64,1}` are binary variables indicating whether to select an item (`1`) or not (`0`). The length of the action vector `a` is $N$, the total number of _combinations_ available for selection. The function also takes the array `n::Array{Float64,1}` that contains the amount of each good to purchase (specified by the shopper beforehand). Finally, the `world(...)` function takes a `context` model, which encapsulates the personal shopper's environment, including the budget constraint and the penalty for exceeding it. More on the `context` in `Task 2`.

We've assumed a _linear utility function_ for the personal shopper problem, where the utility is a linear combination of the items chosen minus a penalty for exceeding the budget. The utility function $U:\mathbb{R}^{N}\rightarrow\mathbb{R}$ is defined as follows:
$$
\begin{align*}
U_{\lambda}(\mathbf{n},\mathbf{\gamma}) = \sum_{i=1}^{N} \gamma_{i}\cdot n_i - \lambda \cdot \left[\max(0, \sum_{i=1}^{N} c_i \cdot n_{i} - B)\right]^{2}
\end{align*}
$$
where $\gamma_{i}$ is the marginal utility of option $i$ (unkown to the agent, only known to the world), while the term $n_i$ denotes the amount of component $i$ in the mixture,  The quadratic penalty term is subtracted from the utility if the total cost exceeds the budget, where $c_i$ is the unit cost of item $i$.

__Hmmm__. Sometimes, we are uncertain about the benefit gained when we purchase good $i$, so let's add some randomness to the problem. In the presence of uncertainty, the utility function becomes:
$$
\begin{align*}
U_{\lambda}(\mathbf{n},\mathbf{\gamma}) = \sum_{i=1}^{N} (\gamma_{i} + \sigma_{i} \cdot Z_i) \cdot n_i - \lambda \cdot \left[\max(0, \sum_{i=1}^{N} c_i \cdot n_{i} - B)\right]^{2}
\end{align*}
$$
where $Z_i \sim \mathcal{N}(0,1)$ is a random variable drawn from a standard normal distribution for each item $i$, and $\sigma_{i}\geq{0}$ denotes the strength of the uncertainty associated with good $i$ (hyperparameter set by the shopper). This adds a stochastic element to the utility function, making it more realistic in scenarios where the benefits of purchasing items are uncertain.

In [2]:
function world(s::Array{Float32,1}, a::Array{Float32,1}, context::MyDQNworldContextModel)

    # initialize -
    γ = context.γ; # consumer preferences (unknown to agent)
    σ = context.σ; # noise in utility calculation (unknown to agent)
    B = context.B; # max budget (unknown to agent)
    C = context.C; # unit costs of goods (unknown to agent)
    λ = context.λ; # sensitivity to the budget
    Z = context.Z; # noise model
    number_of_goods = context.m; # number of possible combinations

    # compute the reward for this choice -
    Ū = 0.0; # initial utility
    BC = 0.0; # initial budget constraint
    for i ∈ 1:number_of_goods
        
        nᵢ = s[i]; # this is the quantity purchased of good aᵢ in category i
        Cᵢ = C[i]; # cost of chosen good in category i
        γᵢ = γ[i]; # preference of good in category i
   
        # update the utility and the budget constraint -
        Ū += γᵢ*(nᵢ + σ[i,1]*rand(Z)); # compute the utility for this good, with noise. We'll use a linear utility model
        BC += nᵢ*(Cᵢ + σ[i,2]*rand(Z)); # compute the budget constraint -
    end

    # compute the utility with the budget constraint
    U = (Ū - λ*max(0.0, (BC - B))^2) .|> Float32 ; # use a penalty method to capture budget constraint

    # compute the next state -
    s′ = (s .+ (a.*s)) .|> Float32; # update the state with the action taken to get next state
    
    # return to caller -
    return s′, U; # return the next state and the reward
end;

__Constants__: Set constants we'll use in the subsequent tasks. See the comment beside the value for a description of what it is, its permissible values, etc.

In [3]:
K = 12; # TODO: Let's consider 12 different items that we need to mix together
number_of_actions = 2*K; # TODO: number of actions (2 for each item,increase/decrease)
number_of_hidden_states = 2^K; # TODO: number of hidden states (2^K made this up)
T = 2^14; # TODO: number of rounds for each decision task (should be geq 2^{K})
budget = 100.0; # TODO: Budget for agent, assume 100 USD. We can change this later if we want
buffersize = 1000; # TODO: buffer size for the agent
B = 64; # TODO: minibatch size for the agent

## Task 2: Setup the Context, Main, Target Networks, and the Replay Buffer
In this task, we will build several models that are required for our deep Q-learning agent. We know that the agent will have a _main_ and a _target_ network, and we will also need a replay buffer to store the agent's experiences. 

### Context Model
The context model is an instance of [the `MyDQNworldContextModel` type](src/Types.jl). It holds various parameters that will be used in the `world(...)` function that we developed in Task 1. We save our context model instance in the `contextmodel::MyDQNworldContextModel` variable.

Let's walk through what we are saying here:

In [4]:
contextmodel = let

    # initialize -
    context = nothing; # initialize the context variable to nothing; this variable will be used to store the context model
    γ = Array{Float32,1}(undef, K); # consumer preferences (unknown to agent)
    σ = Array{Float32,2}(undef, K, 2); # noise in utility calculation (unknown to agent). First col is noise for good, seccond col is noise for price
    C = Array{Float32,1}(undef, K); # unit costs of goods (unknown to agent)
    Z = Normal(0,1); # use a standard normal distribution for the noise model; this can be changed to any distribution as required
    λ = 100000.0; # sensitivity to the budget constraint λ ≥ 0. If zero, then no penalty for budget constraint violation.

    # set the parameters -
    for i ∈ 1:K
        # Assigning values for γ, σ, and C for each good and price in the context model
        # For simplicity, let's assume we have K goods with equal preference
        # This can be customized as per the requirement of the simulation
        rem(i, 2) == 0 ? γ[i] = 1.0 : γ[i] = -1.0; # Fancy! if i is even, then γ[i] = 1.0, else γ[i] = -1.0       
        σ[i,:] .= 0.1; # uniform uncertainty for all goods and prices, this can be adjusted based on the specific needs of the simulation
        C[i] = 10.0 + 10.0 * (i - 1); # linearly increasing costs for goods, this can be customized as per the requirement
    end

    # TODO: Uncomment the code below to build the context model -
    # build a context model with the required parameters -
    context = build(MyDQNworldContextModel, (
        γ = γ, # consumer preferences (unknown to agent)
        σ = σ, # noise in utility calculation (unknown to agent)
        B = B, # max budget (unknown to agent)
        C = C, # unit costs of goods (unknown to agent)
        λ = λ, # sensitivity to the budget
        Z = Z, # noise model
        m = K, # number of components (this should match the number of arms in the algorithm)
    )); # build the context

    # return the model -
    context;
end;

#### Main and Target Networks
The main network is used to select actions, while the target network is used to evaluate the actions taken by the main network. These networks have the same architecture but are updated at different rates. The main network is updated more frequently, while the target network is updated less frequently to provide a stable target for the Q-value updates.

__Implementation__: For the main and target models, let's build an empty model with default (random) parameter values but a fixed structure. The number and dimension of the layers and the activation functions for each layer are specified when we build the model (but we'll update the parameters during training).
* _Library_: We use [the `Flux.jl` machine learning library](https://github.com/FluxML/Flux.jl) to construct the neural network model. The model will have three layers: the input layer is a `K` $\times$ $2^{K}$ layer with [tanh activation functions](https://fluxml.ai/Flux.jl/stable/reference/models/activation/#NNlib.tanh_fast), the hidden layer is a $2^{K}$ $\times$ $\dim\mathcal{A}$ layer and the output layer is the [softmax function](https://en.wikipedia.org/wiki/Softmax_function).
* _Syntax_: The [`Flux.jl` package](https://github.com/FluxML/Flux.jl) uses some next level syntax. The model is built using [the `Chain` function](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Chain), which takes a list of layers as input. Each layer is defined using the [`Dense` type](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Dense) (in this case), which takes the number of input and output neurons as arguments. The activation function is an additional argument to [the `Dense` type](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Dense). The final layer uses [the `softmax(...)` method exported by the `NNlib.jl` package](https://fluxml.ai/NNlib.jl/dev/reference/#Softmax) to produce a probability distribution over the classes.

We save the main network in the `M::Chain` variable:

In [5]:
M = let 
    
    # TODO: Uncomment the code below to build the model!
    Flux.@layer MyFluxNeuralNetworkModel trainable=(input, hidden); # create a "namespaced" of sorts
    MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
        Chain(
            input = Dense(K, number_of_hidden_states, tanh_fast),  # layer 1
            hidden = Dense(number_of_hidden_states, number_of_actions, tanh_fast), # layer 2
            output = NNlib.softmax) # layer 3 (output layer)
    );
    model = MyModel().chain;

    # return -
    model;
end

Chain(
  input = Dense(12 => 4096, tanh_fast),  [90m# 53_248 parameters[39m
  hidden = Dense(4096 => 24, tanh_fast),  [90m# 98_328 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m151_576 parameters, 592.297 KiB.

The target network is a copy of the main network, but its parameters are updated less frequently. The target network is updated by copying the parameters from the main network every few episodes.

We save the target network in the `T::Chain` variable:

In [6]:
T = let

    # TODO: Uncomment the code below to build the model!
    Flux.@layer MyFluxNeuralNetworkModel trainable=(input, hidden); # create a "namespaced" of sorts
    MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
        Chain(
            input = Dense(K, number_of_hidden_states, tanh_fast),  # layer 1
            hidden = Dense(number_of_hidden_states, number_of_actions, tanh_fast), # layer 2
            output = NNlib.softmax) # layer 3 (output layer)
    );
    model = MyModel().chain;

    # return -
    model;
end

Chain(
  input = Dense(12 => 4096, tanh_fast),  [90m# 53_248 parameters[39m
  hidden = Dense(4096 => 24, tanh_fast),  [90m# 98_328 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m151_576 parameters, 592.297 KiB.

### Replay Buffer
The replay buffer is a data structure that stores the agent's experiences. It is used to sample random batches of experiences for training the main network. We will implement a _vanilla DQN_ agent whose replay buffer is a [simple circular buffer](https://en.wikipedia.org/wiki/Circular_buffer#:~:text=In%20computer%20science%2C%20a%20circular,easily%20to%20buffering%20data%20streams.). 
* _What goes into the buffer_? The replay buffer stores the agent's experiences in the form of tuples $(s,a,r,s^{\prime})$, where $s$ is the current state, $a$ is the action taken, $r$ is the reward received, $s^{\prime}$ is the next state.
* _Implementation_? We'll use [the circular buffer implementation exported by the `DataStructures.jl` package](https://github.com/JuliaCollections/DataStructures.jl). The [`CircularBuffer` type](https://juliacollections.github.io/DataStructures.jl/stable/circ_buffer/) is a fixed-size buffer that overwrites the oldest elements when it becomes full. The size of the buffer is set by the `buffersize::Int64`. We can generate $B$ random samples, i.e., our training minibatch from the replay buffer using [a extended `rand(...)` function](https://docs.julialang.org/en/v1/stdlib/Random/#Base.rand). 

The agent builds it's own replay buffer, so here let's play with the `CircularBuffer` type to see how it works.

In [7]:
test_replay_buffer = CircularBuffer{Int}(buffersize); # TODO: create a circular buffer that holdes int    
for i ∈ 1:buffersize
    push!(test_replay_buffer, i); # TODO: fill the buffer with some random values
end

### Learning agent model
Finally, we need to build a model for the learning agent. The agent model is an instance of [the `MyDQNLearningAgentModel` type](src/Types.jl). We save our problem model instance in the `agentmmodel::MyDQNworldProblemModel` variable.

In [8]:
agentmodel = let

    model = build(MyDQNLearningAgentModel, (
        mainnetwork = M, # the main network
        targetnetwork = T, # the target network
        number_of_actions = number_of_actions, # number of actions
        number_of_inputs = K, # number of inputs
        Δ = 0.1, # state peturbation 
    )); # build the agent

    # return -
    model
end;

What is in the agent model?

## Task 3: Let's watch the DQN agent in action.
In this task, we train the agent and then give it some random vectors and see what is says.

In [None]:
my_trained_agent = learn(agentmodel, world, 
    context = contextmodel, numberofepisodes = 2, maxnumberofsteps = 10000);

In [16]:
let
    a = my_trained_agent.actions;
    sₒ = zeros(Float32, K); # initial state
    sₒ[1] = 1.0; # set the first state to 1.0
    p = my_trained_agent.mainnetwork(sₒ) |> u -> NNlib.softmax(u);
end

24-element Vector{Float32}:
 0.041737862
 0.041611947
 0.041635692
 0.041660223
 0.041694764
 0.041613698
 0.041660555
 0.041677054
 0.041743103
 0.041643206
 ⋮
 0.04164222
 0.04167992
 0.04164572
 0.041666742
 0.041623436
 0.041578513
 0.04173692
 0.041679688
 0.041574482