# L16b: Let's build a Deep Q-learning (DQN) Agent
In this lab, we'll look at a Deep Q-learning (DQN) agent whose objective is to learn to mix $K$ different materials to maximize the benefit of the mixture. 

### Tasks
Before you start, execute the `Run All Cells` command to check if you have any code or setup issues - let's get those fixed!

* __Task 1: Setup, Data, Constants__: In this task, we set up the computational environment, load the necessary packages, and prepare the `world(...)` function for our personal shopper problem. We will also define any constants we use throughout the problem set.
* __Task 2: Build the Context Models__:In this task, we will build several models of the contextual information used to inform the agent's recommendations. These models, which are [instances of the `MyBanditConsumerContextModel` type](src/Types.jl), hold various parameters that will be used in the `world(...)` function that we developed in Task 1.
* __Task 3: Evaluation of Scenarios__: In this task, we'll run different context models to evaluate how well our agent performs under various scenarios. We will use the same bandit algorithm in all cases but vary the context model to see how it influences the agent's decisions and performance. We display the results, and ask you a few discussion questions.

Tests throughout the notebook (and at the bottom section) help you determine if things are running correctly. Let's go! (Remember to answer the discussion questions.)
___

## Task 1: Setup, Data, Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [1]:
include("Include.jl");

Next, let's build the `world(...)` function. 
* The `world(...)` function takes the action vector `a::Array{Int64,1}` where the elements of `a::Array{Int64,1}` are binary variables indicating whether to select an item (`1`) or not (`0`). The length of the action vector `a` is $N$, the total number of _combinations_ available for selection. The function also takes the array `n::Array{Float64,1}` that contains the amount of each good to purchase (specified by the shopper beforehand). Finally, the `world(...)` function takes a `context` model, which encapsulates the personal shopper's environment, including the budget constraint and the penalty for exceeding it. More on the `context` in `Task 2`.

We've assumed a _linear utility function_ for the personal shopper problem, where the utility is a linear combination of the items chosen minus a penalty for exceeding the budget. The utility function $U:\mathbb{R}^{N}\rightarrow\mathbb{R}$ is defined as follows:
$$
\begin{align*}
U_{\lambda}(\mathbf{n},\mathbf{\gamma}) = \sum_{i=1}^{N} \gamma_{i}\cdot n_i - \lambda \cdot \left[\max(0, \sum_{i=1}^{N} c_i \cdot n_{i} - B)\right]^{2}
\end{align*}
$$
where $\gamma_{i}$ is the marginal utility of option $i$ (unkown to the agent, only known to the world), while the term $n_i$ denotes the amount of component $i$ in the mixture,  The quadratic penalty term is subtracted from the utility if the total cost exceeds the budget, where $c_i$ is the unit cost of item $i$.

__Hmmm__. Sometimes, we are uncertain about the benefit gained when we purchase good $i$, so let's add some randomness to the problem. In the presence of uncertainty, the utility function becomes:
$$
\begin{align*}
U_{\lambda}(\mathbf{n},\mathbf{\gamma}) = \sum_{i=1}^{N} (\gamma_{i} + \sigma_{i} \cdot Z_i) \cdot n_i - \lambda \cdot \left[\max(0, \sum_{i=1}^{N} c_i \cdot n_{i} - B)\right]^{2}
\end{align*}
$$
where $Z_i \sim \mathcal{N}(0,1)$ is a random variable drawn from a standard normal distribution for each item $i$, and $\sigma_{i}\geq{0}$ denotes the strength of the uncertainty associated with good $i$ (hyperparameter set by the shopper). This adds a stochastic element to the utility function, making it more realistic in scenarios where the benefits of purchasing items are uncertain.

In [None]:
function world(s::Array{Float64}, a::Array{Float64,1}, context::MyDQNworldContextModel)::Float64

    # initialize -
    γ = context.γ; # consumer preferences (unknown to agent)
    σ = context.σ; # noise in utility calculation (unknown to agent)
    B = context.B; # max budget (unknown to agent)
    C = context.C; # unit costs of goods (unknown to agent)
    λ = context.λ; # sensitivity to the budget
    Z = context.Z; # noise model
    number_of_goods = context.m; # number of possible combinations

    # compute the reward for this choice -
    Ū = 0.0; # initial utility
    BC = 0.0; # initial budget constraint
    for i ∈ 1:number_of_goods
        
        nᵢ = s[i]; # this is the quantity purchased of good aᵢ in category i
        Cᵢ = C[i]; # cost of chosen good in category i
        γᵢ = γ[i]; # preference of good in category i
        σᵢ = σ[i]; # standard dev for good i
   
        # update the utility and the budget constraint -
        Ū += γᵢ*(nᵢ + σᵢ*rand(Z)); # compute the utility for this good, with noise. We'll use a linear utility model
        BC += nᵢ*Cᵢ; # compute the budget constraint -
    end

    # compute the utility with the budget constraint
    U = Ū - λ*max(0.0, (BC - B))^2; # use a penalty method to capture budget constraint

    # compute the next state -
    s′ = s .+ (a.*s); # update the state with the action taken to get next state
    
    # return to caller -
    return s′, U; # return the next state and the reward
end;

UndefVarError: UndefVarError: `MyBanditConsumerContextModel` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

__Constants__: Set constants we'll use in the subsequent tasks. See the comment beside the value for a description of what it is, its permissible values, etc.

In [None]:
K = 12; # TODO: Let's consider 12 different items that we need to mix together
number_of_actions = 2*K; # TODO: number of actions (2 for each item,increase/decrease)
T = 2^14; # TODO: number of rounds for each decision task (should be geq 2^{K})
B = 100.0; # TODO: Budget for shopper, assume 100 USD. We can change this later if we want

## Task 2: Setup the Context, Main, Target Networks, and the Replay Buffer
In this task, we will build several models that are required for our deep Q-learning agent.

In [None]:
M = let 
    # fill me in
end

Fill me in

In [None]:
T = let
end

## Task 3: Let's watch the DQN agent in action.