<a href="https://colab.research.google.com/github/wdempsey/AI4Health-Online-Experimentation/blob/main/part2_offline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 2: Synthetic HeartSteps and batch RL


In [None]:
## Import necessary 
import numpy as np
import scipy as sp
from sklearn.linear_model import LinearRegression

#Part 1: Overview on Contextual Bandits

- For each person in a study, let $t=1,\ldots, T$ denote a sequence of decision points.  
- At each decision time $t$,  we observe a state variable $S_t \in \mathbb{R}^p$.  
- After observing the state variable $S_t$, the _agent_ decides to take action $A_t \in \mathcal{A}$.  
- After observing state $S_t$ and taking action $A_t$, the agent receive a reward $R_t$ given by
$$
R_t = r(S_t, A_t) + \epsilon_t
$$
where $r(c,a)$ is a function that maps the state-action pair onto the real line and $\epsilon_t$ is a random error term, e.g., $\mathbb{E} [\epsilon_t] = 0$. 
- The triple (context, action, reward) at a sequence of decision points defines a _contextual bandit_ setting.  
- Here, the goal is to maximize the expected reward at every time point $\mathbb{E}[R_t \mid S_t, A_t=a] = r(S_t, a)$. 
- If we knew the reward function $r: \mathcal{S} \times \mathcal{S} \to \mathbb{R}$, then the optimal action given state $s$ is
$$
a^\star (s) = \max_{a \in \mathcal{A}} r(s, a)
$$

### A simple approach:

- Consider $\mathcal{A} =\{0,1\}$
- Randomize treatment $A_t \sim \text{Bern}(p)$ for $t=1,\ldots,T$
- Then for $t>T$, just choose
$$
\hat A^\star_t = \max_{a \in A} \hat r(S_t, a)
$$
where $\hat r(s,a)$ is the model fit using the batch data collected.
- This is exactly equivalent to running an MRT and then using the data to construct an optimal decision rule based on the regression model.

## Question 1: What aspects of the reward impact decision making?

- The only thing that impacts decision to choose $a = 1$ or $a=0$ is the _advantage function_:
$$
A(s) = r(s,1) - r(s,0)
$$
- In the synthetic example from previous section, we have
$$
A(s) = 0.5 s_{0} - 0.7 s_{1} > 0 \Rightarrow \frac{0.5}{0.7} s_0 > s_1
$$




## Question 2: What are the pros of this simple approach?  What are the cons?  

- Why may we not want to use a randomized policy to collect data in mobile health?


Pros (non-exhaustive)
- Simple algorithm
- With sufficient data will construct a 'good' policy
- Easy to explain 

Cons (non-exhaustive)
- Exploration is random so we learn slowly about the space
- Exploit policy may not be optimal
- How do we know that we collected enough data? 

# Part 1b: Investigating mHealth randomized trial data

- In HeartSteps V2, decision points are 6 times per day.  
- An MRT simulator based on Heartsteps V2 has been built in R and is available [here](https://drive.google.com/drive/folders/1rhCWugawTjEnwmagrOPwxNssrgIsnypT?usp=sharing)


The __State variable__ includes
- __ID__: Numeric id taking values between 1-110
- __Day__: Day-in-study (numeric)
- __Decision time__: Numeric indicator of decision time per day (1-5)
- __Dosage/burden__: Pre-defined function of past pushes (walking + anti-sedentary messages), prior to the current decision time. If there is any message delivered to user's phone (not just intent to treat) between time t and time t+1, e.g., active message at time t and anti-sedentary message between time t and t+1,  the dosage at time t+1 ($X_{t+1}$) is defined as $\lambda \cdot X_{t} + 1$. Otherwise, $X_{t+1} = \lambda * X_{t}$, $\lambda = 0.95$ set by the analysis of HS V1.
- __Engagement Indicator__: Binary indicator of whether the number of screens encountered in app from prior day from 12am to 11:59pm is greater than the 40% quantile of the screens collected.
- __Temperature__: Temperature (In Celsius degree) at the current location
- __Location__: 1 if at a location other than home or work; 0 if at home or work (pre-specified)
- __Variation Indicator__: For each time slot, first calculate the standard deviation of the (possibly imputed) 60-min steps  over the past 7 days.  Let the variation indicator on study day (d+1) to be 1 if the standard deviation calculated at day d is greater or equal to the median of the standard deviations up to day d in the study (excluding the warm up period), where d > 0.
- __Pre-treatment Steps__: Log-transformed steps 30 mins prior to the current decision time from the tracker; $\log(y+0.5)$.
- __Square root of steps yesterday__: The square root of step counts from the tracker collected from 12am to 11:59 pm 

Below we show how to bring the MRT data back into python from the Google drive using [these instructions](https://colab.research.google.com/drive/1cMmtzM7rYc-cpW0fkRiTRb-ySr2UHf1h#scrollTo=XTFHRtl68d40).



In [1]:
import collections

import glob

# Importing drive method from colab for accessing google drive
from google.colab import drive

 ## Importing Dataset from Google Drive

In [2]:
# Mounting drive
# This will require authentication : Follow the steps as guided
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd

HS_MRT_data = pd.read_csv("/content/drive/My Drive/ai4health/HS_MRT_example.csv")
HS_MRT_data.head()

Unnamed: 0,id,day,decision.time,dosage,engagement,other.location,variation,temperature,logpresteps,sqrt.totalsteps,prior.anti,MRT_avails,MRT_probs,MRT_action,MRT_reward
0,1,1,1,0.0,1,1,0,0.644681,0.655219,0.492962,0,0,0.5,0,5.492419
1,1,1,2,0.0,1,1,0,0.765377,-0.693147,0.492962,1,1,0.5,0,-0.693147
2,1,1,3,0.0,0,1,1,0.808704,0.48637,0.492962,0,1,0.5,1,6.006475
3,1,1,4,1.0,1,1,0,0.827079,0.508638,0.492962,0,0,0.5,1,-0.693147
4,1,1,5,1.95,0,1,0,0.725725,0.658955,0.492962,0,0,0.5,0,6.083615


## Part 1c: Going beyond regression (Batch V-Learning)

- The above pre-supposed our goal is to construct a policy that maximizes the proximal outcome at each decision time
- However, decisions at one time may impact the state at a future time
- __Question__: What are some examples of actions having impacts on future states? 

- In this case, we may want to maximize a different objective function.
- Here, we consider the maximizing the state-value function,
$$
V(\pi, s) = \sum_{k \geq 0} \gamma^k \mathbb{E}_\pi \left[ R_{t+k} \mid S_t = s \right]
$$
where $E_{\pi}$ denotes the 
- However, we collected data under an MRT policy, $\mu$.  So we need to re-express in terms of 
\begin{align*}
V(\pi , s) &= \sum_{k \geq 0} \mathbb{E}_\mu \left[ \gamma^k R_{t+k} \left\{ \prod_{v=0}^k \frac{\pi(A_{v+t}; S_{v+t})}{\mu(A_{v+t}; S_{v+t})} \right\} \mid S_t = s \right]  \\
&= \mathbb{E}_\pi \left[ \frac{\pi(A_{v+t}; S_{v+t})}{\mu(A_{v+t}; S_{v+t})} \left( R_t + \gamma \sum_{k \geq 0} \mathbb{E}_\pi \left[ \gamma^k R_{t+k+1} \left\{ \prod_{v=0}^k \frac{\pi(A_{v+t+1}; S_{v+t+1})}{\mu(A_{v+t+1}; S_{v+t+1})} \right\} \mid S_{t+1} \right] \right) \mid S_t \right] \\
&= \mathbb{E}_\pi \left[ \frac{\pi(A_{v+t}; S_{v+t})}{\mu(A_{v+t}; S_{v+t})} \left( R_t + \gamma V(\pi, S_{t+1}) \right) \mid S_t \right] \\
\end{align*}
This implies that
$$
0 = \mathbb{E}_\pi \left[ \frac{\pi(A_{v+t}; S_{v+t})}{\mu(A_{v+t}; S_{v+t})} \left( R_t + \gamma V(\pi, S_{t+1}) - V(\pi, S_t) \right) \mid S_t \right]
$$
in particular, for any $\psi$ defined on the domain of $S$, the state-value function satisfies
$$
0 = \mathbb{E}_\pi \left[ \frac{\pi(A_{v+t}; S_{v+t})}{\mu(A_{v+t}; S_{v+t})} \left( R_t + \gamma V(\pi, S_{t+1}) - V(\pi, S_t) \right) \psi (S_t) \mid S_t \right]
$$
This is an importance-weighted variant of the __Bellman optimality equation__.  For parametrized state-value functions $V(\pi, s; \theta)$, one particular obvious choice is $\psi(s) = \nabla V(\pi, s; \theta)$.

- Suppose we estimate $\hat \theta$, then we can plug this in and define the 
$$
\hat V_{n,\mathcal{R}} (\pi) = \int V(\pi; s, \hat \theta) d\mathcal{R}(s)
$$
where $\mathcal{R}$ is a reference distribution (typically a distribution over initial states).
- Then the estimated optimal regime
$$
\pi_{opt} = \arg \max_{\pi \in \Pi} \hat V_{n,\mathcal{R}} (\pi)
$$

- Suppose that the state-value function is parametrized according to $V(\pi, s; \theta^\pi) = \Phi (s)^\prime \theta$, then define 
$$
\Lambda_n (\pi, \theta^{\pi}) = \left[ n^{-1} \sum_{i=1}^n \sum_{t=1}^{T_i} \frac{\pi(A_{i,t}; S_{i,t})}{\mu(A_{i,t}; S_{i,t})} \left( \gamma \Phi(S_{i,t}) \Phi(S_{i,t+1})^\prime - \Phi(S_{i,t}) \Phi(S_{i,t})^\prime\right) \right] \theta^\pi + n^{-1}  \sum_{i=1}^n \sum_{t=1}^{T_i} \left[\frac{\pi(A_{i,t}; S_{i,t})}{\mu(A_{i,t}; S_{i,t})} R_{i,t} \Phi (S_{i,t}) \right]
$$
and we can estimate
$$
\hat \theta_n^\pi = \arg \min_{\theta^\pi \in \Theta} \left[ \Lambda_n (\pi, \theta^{\pi})^\prime \Lambda_n (\pi, \theta^{\pi}) + \lambda_n (\theta^{\pi})^\prime \theta^{\pi} \right]
$$
where $\lambda_n$ is a tuning parameter.

