<a href="https://colab.research.google.com/github/wdempsey/AI4Health-Online-Experimentation/blob/main/introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Online learning and experimentation algorithms in mobile health

In [30]:
## Import necessary 
import numpy as np
import scipy as sp
from sklearn.linear_model import LinearRegression

#Part 1: Overview on Contextual Bandits

- For each person in a study, let $t=1,\ldots, T$ denote a sequence of decision points.  
- At each decision time $t$,  we observe a state variable $S_t \in \mathbb{R}^p$.  
- After observing the state variable $S_t$, the _agent_ decides to take action $A_t \in \mathcal{A}$.  
- After observing state $S_t$ and taking action $A_t$, the agent receive a reward $R_t$ given by
$$
R_t = r(S_t, A_t) + \epsilon_t
$$
where $r(c,a)$ is a function that maps the state-action pair onto the real line and $\epsilon_t$ is a random error term, e.g., $\mathbb{E} [\epsilon_t] = 0$. 
- The triple (context, action, reward) at a sequence of decision points defines a _contextual bandit_ setting.  
- Here, the goal is to maximize the expected reward at every time point $\mathbb{E}[R_t \mid S_t, A_t=a] = r(S_t, a)$. 
- If we knew the reward function $r: \mathcal{S} \times \mathcal{S} \to \mathbb{R}$, then the optimal action given state $s$ is
$$
a^\star (s) = \max_{a \in \mathcal{A}} r(s, a)
$$

### A simple approach:

- Consider $\mathcal{A} =\{0,1\}$
- Randomize treatment $A_t \sim \text{Bern}(p)$ for $t=1,\ldots,T$
- Then for $t>T$, just choose
$$
\hat A^\star_t = \max_{a \in A} \hat r(S_t, a)
$$
where $\hat r(s,a)$ is the model fit using the batch data collected.

### Linear Contextual Bandit

- Assume that the reward structure follows
$$
r(s,a) = x(s,a)^\top \beta 
$$
where $x(s,a) \in \mathbb{R}^{p}$ is a $p$-dimensional summary of the state and $\beta \in \mathbb{R}^p$ is an unknown parameter.

In [43]:
# Simulation example
T = 200 # number of steps

## Generate context (normal and binary states)
mu, sigma = 0, 1 # mean and standard deviation
state1 = np.random.normal(mu, sigma, T) # Continuous state
state2 = np.random.binomial(n=1, p = 0.7,size=T) # Binary state
state = np.stack((state1,state2), axis = 1) # Compelte State at each time

## Generate actions (MRT with probability  )
action = np.random.binomial(n=1, p = 0.5,size=T) # Binary state

## Generate true reward
def reward(state, action):
  base_reward = state[0] + 0.3*state[1] 
  advantage = 0.5*state[0] - 0.7*state[1]
  return base_reward + advantage * action

y = np.repeat(0.,T)
for t in range(T):
  y[t] = reward(state[t,:], action[t]) + np.random.normal(0, 1, 1)


## Triple
triple = np.column_stack((state,action, y))
print("First 10 entries of state (2D), action, and reward")
print(triple[1:10,:])
print("\n")

## Build the design matrix
X = state
for col in range(2):
  temp = np.multiply(state[:,col],action)
  X = np.column_stack((X, temp))

reg = LinearRegression().fit(X,y)
print("True coefficients using linear model")
print(np.array([1,0.3,0.5,-0.7]))
print("Fitted coefficients using linear model")
print(reg.coef_)





First 10 entries of state (2D), action, and reward
[[ 0.4889732   0.          0.          0.0970272 ]
 [-1.65526392  1.          0.         -2.15776888]
 [ 0.88552154  0.          0.          1.29955221]
 [-1.21393632  1.          0.          0.97508805]
 [-1.79251348  1.          0.         -1.62145004]
 [ 1.42910115  0.          1.          1.68040017]
 [-2.05180643  1.          0.         -3.29626183]
 [ 0.2045409   1.          0.          1.70142421]
 [ 2.92629192  1.          0.          1.87006771]]


True coefficients using linear model
[ 1.   0.3  0.5 -0.7]
Fitted coefficients using linear model
[ 1.08134331  0.49970343  0.50450428 -0.64743794]


## Question 1: What aspects of the reward impact decision making?

- The only thing that impacts decision to choose $a = 1$ or $a=0$ is the _advantage function_:
$$
A(s) = r(s,1) - r(s,0)
$$
- In the example above
$$
A(s) = 0.5 s_{0} - 0.7 s_{1} > 0 \Rightarrow \frac{0.5}{0.7} s_0 > s_1
$$




## Question 2: Suppose you observe this sequence and are told you have a `final' decision in state $S_t$.  What decision would you make?


Solution:

# Part 1b: Investigating mHealth randomized trial data

In HeartSteps V2, decision points are 6 times per day.  We 
Variables include
- XX
- XX



## Part 1c: Bandits in mHealth

## Part 1d: Going beyond Bandits



```
# This is formatted as code
```



# Part 2: Constrained Optimization