# L16b: Developing a Deep Q-learning (DQN) Formulation Agent
In this lab, we'll examine a Deep Q-learning (DQN) agent whose objective is to learn to mix $m$ different materials to maximize the utility of the resulting mixture. Each component in the mix has a different coefficient in the utility function and a different unit cost.

* __Scenario__: You are tasked with formulating a product composed of $m$ components, where each component has a unit price $C_{i}$ (which is variable). The product is constrained to cost less than or equal to a specified budget. The states for our problem will be continuous values $n_{i}$, which are the abundance of component $i$. The actions will be the increase/decrease of component $i$.
* __Agent__: We'll use a DQN agent to explore the composition space, where the _Utility_ of a particular composition (random variable) is measured using a _Utility function_, subject to a budget constraint. 


### Tasks
Before you start, execute the `Run All Cells` command to check if you have any code or setup issues - let's get those fixed!

* __Task 1: Setup, Data, Constants (15 min)__: In this task, we set up the computational environment, load the necessary packages, and prepare the `world(...)` function for our formulation problem. We will also define any constants we use throughout the problem set.
* __Task 2: Set up the Context, Main, Target Networks, and the Replay Buffer (20 min)__: In this task, we will build several required models for our deep Q-learning agent. We know the agent will have a _main_ and a _target_ network, and we will also need a replay buffer to store the agent's experiences. In addition, we'll need a context model.
* __Task 3: Let's watch the DQN agent in action (15 min)__: In this task, we train the agent briefly, then give it some random vectors to see what it says. However, before we do that, let's review the training process.

Tests (and other checkpoints) are located throughout the notebook to help us determine if things are running correctly. Let's go! 
___

## Task 1: Setup, Data, Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [3]:
include("Include.jl");

Next, let's build the `world(...)` function. 
* _What does this function do_? The `world(...)` function takes a state `s::Array{Float32,1}` and action vector `a::Array{Float32,1}` where the elements of the state vector `s::Array{Float32,1}` describes the composition vector, and the action vector `a::Array{Float32,1}` describes the increase/decrease of a component of the mixture. The `world(...)` function also takes a `context` model, which encapsulates the experimental environment, including the budget constraint and the penalty for exceeding it. More on the `context` in `Task 2`.

We've assumed a _linear utility function_ for the personal shopper problem, where the utility is a linear combination of the items chosen minus a penalty for exceeding the budget. The utility function $U:\mathbb{R}^{N}\rightarrow\mathbb{R}$ is defined as follows:
$$
\begin{align*}
U_{\lambda}(\mathbf{n},\mathbf{\gamma}) = \sum_{i=1}^{N} \gamma_{i}\cdot n_i - \lambda \cdot \left[\max(0, \sum_{i=1}^{N} c_i \cdot n_{i} - B)\right]^{2}
\end{align*}
$$
where $\gamma_{i}$ is the marginal utility of option $i$ (unkown to the agent, only known to the world), while the term $n_i$ denotes the amount of component $i$ in the mixture,  The quadratic penalty term is subtracted from the utility if the total cost exceeds the budget, where $c_i$ is the unit cost of item $i$.

__Hmmm__. Sometimes, we are uncertain about the benefit gained from the abundance of $i$, and the unit price of component $i$ is variable. Let's add some randomness to the problem. In the presence of uncertainty, the utility function becomes:
$$
\begin{align*}
U_{\lambda}(\mathbf{n},\mathbf{\gamma}) = \sum_{i=1}^{N} (\gamma_{i} + \sigma_{i} \cdot Z_i) \cdot n_i - \lambda \cdot \left[\max\left(0, \left\{\sum_{i=1}^{N} (c_i + \sigma^{\prime}_{i} \cdot Z_i) \cdot n_{i} - B\right\}\right)\right]^{2}
\end{align*}
$$
where $Z_i \sim \mathcal{N}(0,1)$ is a random variable drawn from a standard normal distribution for each item $i$, and $\sigma^{\star}_{i}\geq{0}$ denotes the strength of the uncertainty associated with good $i$ (hyperparameter set by us). This adds a stochastic element to the utility function, making it more realistic when the benefits or costs associated with an item is uncertain.

In [5]:
function world(s::Array{Float32,1}, a::Array{Float32,1}, context::MyDQNworldContextModel)

    # initialize -
    γ = context.γ; # consumer preferences (unknown to agent)
    σ = context.σ; # noise in utility calculation (unknown to agent)
    B = context.B; # max budget (unknown to agent)
    C = context.C; # unit costs of goods (unknown to agent)
    λ = context.λ; # sensitivity to the budget
    Z = context.Z; # noise model
    number_of_goods = context.m; # number of possible combinations

    # compute the reward for this choice -
    Ū = 0.0; # initial utility
    BC = 0.0; # initial budget constraint
    for i ∈ 1:number_of_goods
        
        nᵢ = s[i]; # this is the quantity purchased of good aᵢ in category i
        Cᵢ = C[i]; # cost of chosen good in category i
        γᵢ = γ[i]; # preference of good in category i
   
        # update the utility and the budget constraint -
        Ū += γᵢ*(nᵢ + σ[i,1]*rand(Z)); # compute the utility for this good, with noise. We'll use a linear utility model
        BC += nᵢ*(Cᵢ + σ[i,2]*rand(Z)); # compute the budget constraint -
    end

    # Compute the utility with the budget constraint
    U = (Ū - λ*max(0.0, (BC - B))^2) .|> Float32 ; # use a penalty method to capture budget constraint

    # compute the next state -
    s′ = (s .+ (a.*s)); # update the state with the action taken to get the next state

    # check: for NaN replace with zero
    s′ = max.(s′, 1e-6); # ensure that the state is non-negative
    
    # implement state constraints -
    s′ = s′ .|> Float32; # ensure that the state is non-negative
    
    # return to caller -
    return s′, U; # return the next state and the reward
end;

__Constants__: Set constants we'll use in the subsequent tasks. See the comment beside the value for a description of what it is, its permissible values, etc.

In [7]:
K = 12; # TODO: Let's consider 12 different items that we need to mix together
number_of_actions = 2*K; # TODO: number of actions (2 for each item,increase/decrease)
number_of_hidden_states = 2^K; # TODO: number of hidden states (2^K made this up)
number_of_episodes = 2; # TODO: number of episodes (2^K
max_number_of_iterations = 2^14; # TODO: number of rounds for each decision task (should be geq 2^{K})
budget = 1000.0; # TODO: Budget for agent, assume 100 USD. We can change this later if we want
buffersize = 2^10; # TODO: buffer size for the agent
B = 2^6; # TODO: minibatch size for the agent

## Task 2: Setup the Context, Main, Target Networks, and the Replay Buffer
In this task, we will build several required models for our deep Q-learning agent. We know the agent will have a _main_ and a _target_ network, and we will also need a replay buffer to store the agent's experiences. In addition, we'll need a context model.

### Context Model
The context model is an instance of [the `MyDQNworldContextModel` type](src/Types.jl). It holds various parameters that will be used in the `world(...)` function that we developed in Task 1. We save our context model instance in the `contextmodel::MyDQNworldContextModel` variable.

Let's walk through what we are saying here:

In [10]:
contextmodel = let

    # initialize -
    context = nothing; # initialize the context variable to nothing; this variable will be used to store the context model
    γ = Array{Float32,1}(undef, K); # consumer preferences (unknown to agent)
    σ = Array{Float32,2}(undef, K, 2); # noise in utility calculation (unknown to agent). First col is noise for good, seccond col is noise for price
    C = Array{Float32,1}(undef, K); # unit costs of goods (unknown to agent)
    Z = Normal(0,1); # use a standard normal distribution for the noise model; this can be changed to any distribution as required
    λ = 0.0; # sensitivity to the budget constraint λ ≥ 0. If zero, then no penalty for budget constraint violation.

    # set the parameters -
    for i ∈ 1:K
        # Assigning values for γ, σ, and C for each good and price in the context model
        # For simplicity, let's assume we have K goods with equal preference
        # This can be customized as per the requirements of the simulation
        rem(i, 2) == 0 ? γ[i] = 1.0 : γ[i] = -1.0; # Fancy! if i is even, then γ[i] = 1.0, else γ[i] = -1.0       
        σ[i,:] .= 0.1; # uniform uncertainty for all goods and prices, this can be adjusted based on the specific needs of the simulation
        C[i] = 10.0 + 10.0 * (i - 1); # linearly increasing costs for goods, this can be customized as per the requirement
    end

    # TODO: Uncomment the code below to build the context model -
    # build a context model with the required parameters -
    context = build(MyDQNworldContextModel, (
        γ = γ, # consumer preferences (unknown to agent)
        σ = σ, # noise in utility calculation (unknown to agent)
        B = B, # max budget (unknown to agent)
        C = C, # unit costs of goods (unknown to agent)
        λ = λ, # sensitivity to the budget
        Z = Z, # noise model
        m = K, # number of components
    )); # build the context

    # return the model -
    context;
end;

In [11]:
# Checkpoint: TODO: What's in the context model?

In [53]:
contextmodel.C

12-element Vector{Float32}:
  10.0
  20.0
  30.0
  40.0
  50.0
  60.0
  70.0
  80.0
  90.0
 100.0
 110.0
 120.0

#### Main and Target Networks
The main network is used to select actions, while the target network is used to evaluate the actions taken by the main network. These networks have the same architecture but are updated at different rates. The main network is updated more frequently, while the target network is updated less frequently to provide a stable target for the Q-value updates.

__Implementation__: For the main and target models, let's build an empty model with default (random) parameter values but a fixed structure. The number and dimension of the layers and the activation functions for each layer are specified when we build the model (but we'll update the parameters during training).
* _Library_: We use [the `Flux.jl` machine learning library](https://github.com/FluxML/Flux.jl) to construct the neural network model. The model will have three layers: the input layer is a `K` $\times$ $2^{K}$ layer with [tanh activation functions](https://fluxml.ai/Flux.jl/stable/reference/models/activation/#NNlib.tanh_fast), the hidden layer is a $2^{K}$ $\times$ $\dim\mathcal{A}$ layer and the output layer is the [softmax function](https://en.wikipedia.org/wiki/Softmax_function).
* _Syntax_: The [`Flux.jl` package](https://github.com/FluxML/Flux.jl) uses some next level syntax. The model is built using [the `Chain` function](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Chain), which takes a list of layers as input. Each layer is defined using the [`Dense` type](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Dense) (in this case), which takes the number of input and output neurons as arguments. The activation function is an additional argument to [the `Dense` type](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Dense). The final layer uses [the `softmax(...)` method exported by the `NNlib.jl` package](https://fluxml.ai/NNlib.jl/dev/reference/#Softmax) to produce a probability distribution over the classes.

We save the main network in the `M::Chain` variable:

In [13]:
M = let 
    
    # TODO: Uncomment the code below to build the model!
    Flux.@layer MyFluxNeuralNetworkModel trainable=(input, hidden); # create a "namespaced" of sorts
    MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
        Chain(
            input = Dense(K, number_of_hidden_states, tanh_fast),  # layer 1
            hidden = Dense(number_of_hidden_states, number_of_actions, tanh_fast), # layer 2
            output = NNlib.softmax) # layer 3 (output layer)
    );
    model = MyModel().chain;

    # return -
    model;
end

Chain(
  input = Dense(12 => 4096, tanh_fast),  [90m# 53_248 parameters[39m
  hidden = Dense(4096 => 24, tanh_fast),  [90m# 98_328 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m151_576 parameters, 592.297 KiB.

The target network is a copy of the main network, but its parameters are updated less frequently. The target network is updated by copying the parameters from the main network every few episodes.

We save the target network in the `T::Chain` variable:

In [15]:
T = let

    # TODO: Uncomment the code below to build the model!
    Flux.@layer MyFluxNeuralNetworkModel trainable=(input, hidden); # create a "namespaced" of sorts
    MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
        Chain(
            input = Dense(K, number_of_hidden_states, tanh_fast),  # layer 1
            hidden = Dense(number_of_hidden_states, number_of_actions, tanh_fast), # layer 2
            output = NNlib.softmax) # layer 3 (output layer)
    );
    model = MyModel().chain;

    # return -
    model;
end

Chain(
  input = Dense(12 => 4096, tanh_fast),  [90m# 53_248 parameters[39m
  hidden = Dense(4096 => 24, tanh_fast),  [90m# 98_328 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m151_576 parameters, 592.297 KiB.

### Replay Buffer
The replay buffer is a data structure that stores the agent's experiences. It is used to sample random batches of experiences for training the main network. We will implement a _vanilla DQN_ agent whose replay buffer is a [simple circular buffer](https://en.wikipedia.org/wiki/Circular_buffer#:~:text=In%20computer%20science%2C%20a%20circular,easily%20to%20buffering%20data%20streams.). 
* _What goes into the buffer_? The replay buffer stores the agent's experiences in the form of tuples $(s, a,r,s^{\prime})$, where $s$ is the current state, $a$ is the action taken, $r$ is the reward received, $s^{\prime}$ is the next state.
* _Implementation_? We'll use [the circular buffer implementation exported by the `DataStructures.jl` package](https://github.com/JuliaCollections/DataStructures.jl). The [`CircularBuffer` type](https://juliacollections.github.io/DataStructures.jl/stable/circ_buffer/) is a fixed-size buffer that overwrites the oldest elements when it becomes full. The size of the buffer is set by the `buffersize::Int64`. We can generate $B$ random samples, i.e., our training minibatch from the replay buffer using [an extended `rand(...)` function](https://docs.julialang.org/en/v1/stdlib/Random/#Base.rand). 

The agent builds its replay buffer, so let's try a sample `CircularBuffer` to see how it works.

In [55]:
test_replay_buffer = CircularBuffer{Int}(10); # TODO: create a circular buffer that holdes int    
for i ∈ 1:buffersize
    push!(test_replay_buffer, i); # TODO: fill the buffer with some random values
end

In [18]:
# Checkpoint: TODO: What happens if we push more than the buffer size?

In [57]:
test_replay_buffer

10-element CircularBuffer{Int64}:
 1015
 1016
 1017
 1018
 1019
 1020
 1021
 1022
 1023
 1024

In [71]:
push!(test_replay_buffer, rand(1:50))

10-element CircularBuffer{Int64}:
 1022
 1023
 1024
    6
   30
    1
   43
   29
    2
   27

In [79]:
rand(test_replay_buffer, 5)

5-element Vector{Int64}:
    1
 1023
   29
   43
   27

### Learning agent model
Finally, we need to build a model for the learning agent. The agent model is an instance of [the `MyDQNLearningAgentModel` type](src/Types.jl). We save our problem model instance in the `agentmmodel::MyDQNworldProblemModel` variable.

In [20]:
agentmodel = let

    model = build(MyDQNLearningAgentModel, (
        mainnetwork = M, # the main network
        targetnetwork = T, # the target network
        number_of_actions = number_of_actions, # number of actions
        number_of_inputs = K, # number of inputs
        Δ = 0.1, # state peturbation 
    )); # build the agent

    # return -
    model
end;

What is in the agent model?

In [22]:
## Checkpoint: TODO: make sure we understand what data our agent has

## Task 3: Let's watch the DQN agent in action.
In this task, we train the agent briefly, then give it some random vectors to see what it says. However, before we do that, let's review the training process.

#### DQN Training Algorithm
__Initialize__ the parameters of the main Q-network $Q_{\theta}(s)$ and the target Q-network $Q^{\prime}_{\theta^{-}}(s)$ to random values. Initialize a (potentially infinite) replay buffer $\mathcal{B}$. Set the hyperparameters: the learning rate $\alpha$, the discount factor $\gamma$, the exploration rate $\epsilon_{t}$, the minimum number of experiences in the replay buffer $B$, and the parameter update count $\mathcal{C}$.
- For each episode, initialize the state to $s_0$ and:
   - For each time step $t=1,\ldots,T$:
        1. Role a random number $p\in[0,1]$. If $p\leq\epsilon_{t}$, choose a random (uniform) action $a_{t}\in\mathcal{A}$. Otherwise, choose a greedy action $a_{t} = \text{arg}\max_{a\in\mathcal{A}}{Q_{\theta}(s_{t})}$.
        2. Execute action $a_{t}$, observe the reward $r_{t}$ from the _world_ and transition to the next state $s_{t+1}$. 
        3. Store the transition (experience) $\mathcal{e}=(s_t, a_t, r_t, s_{t+1})$ in the replay buffer: $\mathcal{e}\rightarrow\mathcal{B}$. 
        5. If the replay buffer $\mathcal{B}$ has a _minium number of elements_: sample a mini-batch of experiances $(s_i, a_i, r_i, s_{i+1})$ from the replay buffer.  The agent randomly samples a mini-batch of $B$ transitions from the replay buffer:  $(s_j, a_j, r_j, s_{j+1}),\, j = 1, 2, \dots, B$. Each tuple represents a state-action-reward-next state experience example collected during environment interaction.
        6. Compute the _target Q-value_ for each transition in the mini-batch using the _target Q-network_: $y_i = r_i + \gamma \cdot \max_{a^{\prime}\in\mathcal{A}}Q^{\prime}_{\theta^{-}}(s_{i+1})$ for $i=1,2,\ldots,B$.
        7. Compute the _mean squared loss_ function over the $B$ experiances collected in the mini-batch: $L(\theta) = \frac{1}{B}\sum_{i=1}^{B}\left(y_i - Q_{\theta}(s_i)\right)^2$.
        8. Perform a _single_ gradient descent step to minimize the loss function $L(\theta)$ with respect to the parameters $\theta$ of the main Q-network $Q_{\theta}(s)$: $\theta \leftarrow \theta - \alpha \nabla_{\theta}L(\theta)$, where $\alpha$ is the learning rate. 
            - _Why only a single step_? Each mini-batch is just a _small sample of the environment’s dynamics._ The goal of DQN is _online learning_: the network parameters are continuously updated as new experiences come in. If we force training to converge on each mini-batch, it risks _overfitting to that mini-batch_.
        10. Update the state $s_t \leftarrow s_{t+1}$.
        9. Every $C$ steps, update the target Q-network parameters: $\theta^{-} \leftarrow \theta$.
    - End For
- End For

We've implemented this algorithm in [the `learn(...)` method](src/Compute.jl). Let's run the training process and test the agent on random vectors. 

In [24]:
my_trained_agent = learn(agentmodel, world, 
    context = contextmodel, numberofepisodes = number_of_episodes, maxnumberofsteps = max_number_of_iterations, 
    minibatchsize = B, maxreplaybuffersize=buffersize);

What's in the agent's replay buffer? 

In [26]:
s, a, r, s′ = let
    
    # initialize -
    agent_replay_buffer = my_trained_agent.replaybuffer;

    # draw a random experience from the replay buffer -
    experience = rand(agent_replay_buffer); # draw a random experience
    s, a, r, s′ = experience; # unpack the experience
end;

__Check__: Do we recover the next state $s^{\prime}$ from the $s$ and $a$ values in the replay buffer?

In [28]:
let
    true_s′ = s′;
    computed_s′ = s .+ (a.*s); # compute the next state
    @assert true_s′ ≈ computed_s′; # check if the computed next state is equal to the true next state
end

In [29]:
r

14.854555f0

In [30]:
# TODO: What does a s -a-> s′ look like?
[s s′ a]

12×3 Matrix{Float32}:
  6.69435    6.69435   -0.0
  1.03138    1.03138   -0.0
  1.05264    1.05264   -0.0
  0.315708   0.315708  -0.0
  0.896279   0.896279  -0.0
  1.08324    1.08324   -0.0
  0.933885   0.933885  -0.0
  1.11662    1.11662   -0.0
  0.837063   0.837063  -0.0
 22.3394    22.3394    -0.0
  0.83514    0.83514   -0.0
  1.0f-6     1.0f-6    -0.01

In [31]:
let
    a = my_trained_agent.actions;
    sₒ = rand(Float32, K); # initial state
    i = my_trained_agent.mainnetwork(sₒ) |> u-> argmax(u)
end

7

# The End: Thank You!

Thank you each for your hard work and engagement throughout this Machine Learning and Artificial Intelligence course. 
* _What did we do_? We’ve explored many exciting topics, from __clustering__ and __binary/multiclass classification__ to __kernels__, early models such as __Hopfield networks__, and __Boltzmann machines__. We’ve also navigated the landscape of __neural networks__ in their many forms and explored online learning approaches like __multi-armed bandits__ and __reinforcement learning__.

Your curiosity, persistence, and thoughtful questions made this an enriching experience. I hope you carry forward not just the technical skills but also a deeper appreciation for the power and possibilities of machine learning.

Wishing you all the best in your future studies and projects—**keep exploring and pushing boundaries!**