# PS4: A Contextual Stochastic Bandit Personal Shopper
Fill me in

### Background: Consumer choice problems
Imagine that a consumer must choose $m$ possible goods (where each good is in a category with $k$ alternatives) $\mathbf{n} = \left\{n_{1},n_{2},\dots,n_{m}\right\}$, where $n_{j}\in\mathbb{R}_{\geq{0}}$ is the quantity of good $j$ chosen, i.e., the consumer must choose a non-negative quantity of any good. Different combinations of the $m$-goods are _scored_, i.e., how much benefit or happiness they ellicit, using a utility function $U:\mathbb{R}^{m}\rightarrow\mathbb{R}$. In this case, let's assume our consumer uses a linear utility model:
$$
\begin{align*}
U(\mathbf{n}) = \sum_{i=1}^{m}n_{i}\cdot{\gamma_{i}}
\end{align*}
$$
where $\gamma_{i}$ denote _user sentiment_ parameters: if $\gamma_{i}>{0}$, then good $i$ is _preferred_ (an increase in good $n_{i}$ all else held the same, increases the utility), otherwise if $\gamma_{i}<{0}$ then good $i$ is _not preferred_ (an increase in good $n_{i}$, all else held the same, gives a lower utility). Finally, the choice of goods $n_{1},\dots,n_{m}$ is subject to a budget constraint (and potentially other constraints) that limits the total amount of money the consumer can spend on goods:
$$
\begin{align*}
\sum_{i=1}^{m}n_{i}\cdot{C}_{i} \leq B
\end{align*}
$$
where $C_{i}$ is the unit cost of good $i$, and $B$ is the total budget the consumer can spend. The objective of a consumer is to maximize the utility of their choice (the combination of $m$ goods) subject to a budget constraint. We could solve this problem using [Linear Programming](https://en.wikipedia.org/wiki/Linear_programming), however, let's try to do this as a contextual stochastic bandit problem where the agent (personal shopper) learns to recommend a combination of goods over time based on user feedback.

## Background: Stochastic Multi-Armed Bandits
In the stochastic multi-armed bandit problem, an agent must choose an action $a$ from the set of all possible actions $\mathcal{A}$, where $\dim\mathcal{A} = K$ during each round $t = 1,2,\dots, T$ of a decision task. The agent chooses action $a\in\mathcal{A}$ and receives a reward $r_{a}$ from the environment, where $r_{a}$ is sampled from some (unknown) distribution $\mathcal{D}_{a}$.

For $t = 1,2,\dots,T$:
1. _Aggregator_: The agent picks an action $a_{t} \in \mathcal{A}$ at time time $t$. How the agent makes this choice is one of the main differences between the different algorithms for solving this problem. 
2. _Adversary_: The agent implements action $a_{t}$ and receives a (random) reward $r_{a}\sim\mathcal{D}_{a}$ where $r_{t}\in\left[0,1\right]$. The distribution $\mathcal{D}_{a}$ is only known to the adversary.
3. The agent updates its _memory_ with the reward and continues to the next decision task. 

The agent is interested in learning the mean of the reward distribution of each arm, $\mu(a) = \mathbb{E}\left[r_{t}\sim\mathcal{D}_{a}\right]$, by experimenting against the world (adversary). 
* __Goal__: The goal of the agent is to maximize the total reward. However, the goal of the algorithm designer is to minimize the _regret_ of the algorithm that the agent uses to choose $a\in\mathcal{A}$.

## Task 1: Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [19]:
include("Include.jl");

First, let's build the `world(...)` function. 
* This function takes the $m$-dimensional action vector `a::Array{Int64,1}` where the elements of `a::Array{Int64,1}` are the indexes of the goods chosen from each categories, the amount of each good selected from each category from our agent is in the `n::Dict{Int,Array{Float64,1}}` dictionary, and returns the reward (utility) $r\sim\mathcal{D}_{a}$. associated with selecting this action. 

In [20]:
function world(a::Vector{Int64}, n::Dict{Int,Array{Float64,1}}, context::MyBanditConsumerContextModel)::Float64

    # initialize -
    γ = context.γ; # consumer preferences (unknown to bandits)
    σ = context.σ; # noise in utility calculation (unknown to bandits)
    B = context.B; # max budget (unknown to bandits)
    C = context.C; # unit costs of goods (unknown to bandits)
    λ = context.λ; # sensitivity to the budget
    Z = context.Z; # noise model
    ϵ = 0.001; # min unit required
    number_of_categories = context.m; # number of categories

    # compute the reward for this choice -
    Ū = 1.0;
    BC = 0.0;
    for i ∈ 1:number_of_categories
        
        # what action in category i, did we just take?
        aᵢ = a[i]; # this is which good to purchase in category i -
        nᵢ = max(ϵ, n[i][aᵢ]); # this is how much of good i to purchase (must be geq ϵ)
        Cᵢ = C[i][aᵢ]; # cost of chosen good in category i
        γᵢ = γ[i][aᵢ]; # preference of good in category i
        σᵢ = (σ[i][aᵢ]); # standard dev for good i
        Zᵢ = Z[i]; # noise model
   
        # update the utility -
        Ū += γᵢ*(nᵢ + σᵢ*rand(Zᵢ)); # compute the utility for this good, with noise
        
        # constraints and noise 
        BC += nᵢ*Cᵢ; # compute the budget constraint -
    end

    # compute the budget constraint violation -
    U = Ū - λ*max(0.0, (BC-B))^2; # use a penalty method to capture budget constraint

    # return the reward -
    return U;
end;

Next, let's build an algorithm model that we'll use to reason about the world, i.e., a model of our agent's decision making process.  we've modified the $\epsilon$-greedy algorithm to work with our contextual category-based bandit problem. The algorithm will now take into account the context provided by [the `MyBanditConsumerContextModel` model](src/Types.jl) and category-based actions when selecting actions.

#### Epsilon-Greedy with Categories Algorithm
In the _epsilon-greedy_ algorithm, the agent chooses the best arm with probability $1-\epsilon$ and a random arm with probability $\epsilon$. This approach balances exploration and exploitation by allowing the agent to explore different arms while also exploiting the best-known arm based on past rewards. The parameter $\epsilon$ controls the exploration rate: a higher value means more exploration, while a lower value means more exploitation.

* While [Slivkins](https://arxiv.org/abs/1904.07272) doesn't give a reference for the $\epsilon$-greedy algorithm, other sources point to (at least in part) to [Thompson and Thompson sampling, proposed in 1933 in the context of drug trials](https://arxiv.org/abs/1707.02038). Thus, the $\epsilon$-greedy algorithm, considered a classic algorithm in the multi-armed bandit literature. The algorithm is simple yet effective, making it a popular choice for many practical applications.

The agent has $K$ arms (choices), $\mathcal{A} = \left\{1,2,\dots,K\right\}$, and the total number of rounds is $T$. The agent uses the following algorithm to choose which arm to pull (which action to take) during each round:

For $t = 1,2,\dots,T$:
1. _Initialize_: Roll a random number $p\in\left[0,1\right]$ and compute a threshold $\epsilon_{t}\sim{t}^{-1/3}$. Note, in other sources, $\epsilon$ is a constant, not a function of $t$.
2. _Exploration_: If $p\leq\epsilon_{t}$, choose a random (uniform) arm $a_{t}\in\mathcal{A}$. Execute the action $a_{t}$ and receive a reward $r_{t}$ from the _adversary_ (nature). 
3. _Exploitation_: Else if $p>\epsilon_{t}$, choose action $a^{\star}$ (action with the highest average reward so far, the greedy choice). Execute the action $a^{\star}_{t}$ and recieve a reward $r_{t}$ from the _adversary_ (nature).
4. Update list of rewards for $a_{t}\in\mathcal{A}$

__Theorem__: The epsilon-greedy algowithm with exploration probability $\epsilon_{t}={t^{-1/3}}\cdot\left(K\cdot\log(t)\right)^{1/3}$ achives a regret bound of $\mathbb{E}\left[R(t)\right]\leq{t}^{2/3}\cdot\left(K\cdot\log(t)\right)^{1/3}$ for each round $t$.
___

In [21]:
algorithm = let

    # initialize -
    K = Dict{Int64,Int64}(); # arms dictionary
    n = Dict{Int64, Array{Float64,1}}() # items dictionary

    # How many alternatives (arms) do we have in category?
    K[1] = 3; # category 1 has three possible choices
    K[2] = 6; # categorty 2 has six possible choices
    K[3] = 4; # category 4 has four possible choices

    # how many items would we purchase *if* we choose alternative i in category j?  
    n[1] = [1.0, 1.0, 1.0]; # category 1
    n[2] = [2.0, 2.0, 2.0, 2.0, 2.0, 2.0]; # category 2
    n[3] = [3.0, 3.0, 3.0, 3.0]; # category 3
    
    # build model -
    algorithm = build(MyEpsilonGreedyAlgorithmModel, (
        K = K, # arms dictionary
        n = n, # items dictionary
    ));

    # return the algorithm -
    algorithm;
end;

__Constants__: Finally, let's set some constants we'll use in the subsequent tasks. See the comment beside the value for a description of what it is, its permissible values, etc.

In [22]:
T = 10000; # number of rounds for each decision task

## Task 2: Build the Context Models
In this task, we will build several models of the contextual information that will be used to inform the agent's recomendations. These models, which are [instances of the `MyBanditConsumerContextModel` type](src/Types.jl) holds various parameters that will be used in the `world(...)` funtion that we developed above. 
* _Hmmm. Why use a different model for contextual data_? We use a separate model for the contextual information because it allows us to encapsulate all the relevant parameters and settings in one place. This makes it easier to manage and modify the parameters as needed, without having to change the core logic of the `world(...)` function. Additionally, it allows us to easily pass around context models to other parts of our codebase that may need it.
* _What does this represent_? The contextual information in the `MyBanditConsumerContextModel` represents the parameters that will be used to score the utility of the goods chosen by the agent. This includes user sentiment parameters, budget constraints, and other relevant information that will help the agent make informed decisions about which goods to recommend.

Let's build the following contextual models:
* __Case 1: Unlimited budget, uniform positive sentiment__: This model assumes that the consumer has an unlimited budget ($\lambda = 0$) and uniform positive sentiment across all goods. This means that the consumer is equally likely to choose any good in each category, and there are no constraints on the amount of each good that can be selected. 
* __Case 2: Limited budget, positive sentiment__: This model assumes that the consumer has a limited budget $\lambda>0$ and positive sentiment (but not neccesarily uniform) towards goods. This means that the consumer is more likely to choose goods that they have a positive sentiment towards, and there are constraints on the amount of each good that can be selected based on the budget.

In [23]:
simple_no_budget_context = let

    # initialize -
    γ = Dict{Int,Vector{Float64}}(); # consumer preferences (unknown to bandits)
    σ = Dict{Int,Vector{Float64}}(); # noise in utility calculation (unknown to bandits)
    B = 100.0; # max budget (unknown to bandits)
    C = Dict{Int,Vector{Float64}}(); # unit costs of goods (unknown to bandits)
    λ = 0.0; # sensitivity to the budget constraint λ ≥ 0. If zero, then no penalty for budget constraint violation.
    Z = Dict{Int,Normal}(); # noise model
    number_of_categories = 3; # number of categories

    # set the parameters -
    # preferences: If all γ[i] are equal to 1.0, then the bandit will be indifferent to the goods in each category.
    γ[1] = [1.0, 1.0, 1.0]; # category 1 
    γ[2] = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]; # category 2
    γ[3] = [1.0, 1.0, 1.0, 1.0]; # category 3

    # uncertainty
    σ[1] = [0.1, 0.1, 0.1]; # category 1
    σ[2] = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1]; # category 2
    σ[3] = [0.1, 0.1, 0.1, 0.1]; # category 3

    # costs
    C[1] = [10.0, 20.0, 30.0]; # category 1
    C[2] = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0]; # category 2
    C[3] = [10.0, 20.0, 30.0, 40.0]; # category 3

    # noise model
    Z[1] = Normal(0.0, 1.0); # category 1
    Z[2] = Normal(0.0, 1.0); # category 2
    Z[3] = Normal(0.0, 1.0); # category 3

    # build a context model with the reqired parameters -
    context = build(MyBanditConsumerContextModel, (
        γ = γ, # consumer preferences (unknown to bandits)
        σ = σ, # noise in utility calculation (unknown to bandits)
        B = B, # max budget (unknown to bandits)
        C = C, # unit costs of goods (unknown to bandits)
        λ = λ, # sensitivity to the budget
        Z = Z, # noise model
        m = number_of_categories
    )); # build the context

    # return 
    context;
end;

In [None]:
simple_with_budget_context = let

    # initialize -
    γ = Dict{Int,Vector{Float64}}(); # consumer preferences (unknown to bandits)
    σ = Dict{Int,Vector{Float64}}(); # noise in utility calculation (unknown to bandits)
    B = 100.0; # max budget (unknown to bandits)
    C = Dict{Int,Vector{Float64}}(); # unit costs of goods (unknown to bandits)
    λ = 100.0; # sensitivity to the budget constraint λ ≥ 0. If zero, then no penalty for budget constraint violation.
    Z = Dict{Int,Normal}(); # noise model
    number_of_categories = 3; # number of categories

    # set the parameters -
    # preferences: If all γ[i] are equal to 1.0, then the bandit will be indifferent to the goods in each category.
    γ[1] = [1.0, 2.0, 3.0]; # category 1 
    γ[2] = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]; # category 2
    γ[3] = [1.0, 2.0, 3.0, 4.0]; # category 3

    # uncertainty
    σ[1] = [0.1, 0.1, 0.1]; # category 1
    σ[2] = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1]; # category 2
    σ[3] = [0.1, 0.1, 0.1, 0.1]; # category 3

    # costs
    C[1] = [10.0, 20.0, 30.0]; # category 1
    C[2] = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0]; # category 2
    C[3] = [10.0, 20.0, 30.0, 40.0]; # category 3

    # noise model
    Z[1] = Normal(0.0, 1.0); # category 1
    Z[2] = Normal(0.0, 1.0); # category 2
    Z[3] = Normal(0.0, 1.0); # category 3

    # build a context model with the reqired parameters -
    context = build(MyBanditConsumerContextModel, (
        γ = γ, # consumer preferences (unknown to bandits)
        σ = σ, # noise in utility calculation (unknown to bandits)
        B = B, # max budget (unknown to bandits)
        C = C, # unit costs of goods (unknown to bandits)
        λ = λ, # sensitivity to the budget
        Z = Z, # noise model
        m = number_of_categories
    )); # build the context

    # return 
    context;
end;

## Task 3: Evaluation of Scenarios
In this task, we'll run different context models to evaluate how well our agent performs under different scenarios. This will help us understand how the choice of context model affects the agent's performance and the regret it incurs over time. 

In all cases, we will use the same agent and bandit algorithm but vary the context model to see how it influences the agent's decisions and performance. We'll use the $\epsilon$-greedy algorithm to estimate an optimal policy for the agent based on the contextual information provided by the `MyBanditConsumerContextModel`.

__What does the modified algorithm produce__? Let's build a table to see that choices that the bandits in each category are making. `Unhide` the code-block below to see how we construct the simulation results table.

In [25]:
 let
    results_case_1 = solve(algorithm, T = T, world = world, context=simple_no_budget_context);
    table(results_case_1, algorithm, simple_no_budget_context) |> df -> pretty_table(df, tf = tf_simple)
 end

 [1m category [0m [1m action [0m [1m    γᵢ [0m [1m       n [0m [1m unitcost [0m [1m cumspend [0m [1m remaining [0m [1m       U [0m
 [90m    Int64 [0m [90m  Int64 [0m [90m Int64 [0m [90m Float64 [0m [90m  Float64 [0m [90m  Float64 [0m [90m   Float64 [0m [90m Float64 [0m
         1        3       1       1.0       30.0       30.0        70.0       1.0
         2        4       1       2.0       40.0      110.0       -10.0       3.0
         3        1       1       3.0       10.0      140.0       -40.0       6.0


In [30]:
let
    results_case_2 = solve(algorithm, T = T, world = world, context=simple_with_budget_context);
    table(results_case_2, algorithm, simple_with_budget_context) |> df -> pretty_table(df, tf = tf_simple)
end

 [1m category [0m [1m action [0m [1m    γᵢ [0m [1m       n [0m [1m unitcost [0m [1m cumspend [0m [1m remaining [0m [1m       U [0m
 [90m    Int64 [0m [90m  Int64 [0m [90m Int64 [0m [90m Float64 [0m [90m  Float64 [0m [90m  Float64 [0m [90m   Float64 [0m [90m Float64 [0m
         1        1       1       1.0       10.0       10.0        90.0       1.0
         2        3       3       2.0       30.0       70.0        30.0       7.0
         3        1       1       3.0       10.0      100.0         0.0      10.0
