## **RADI608: Data Mining and Machine Learning**

### Assignment: Reinforcement Learning
**Romen Samuel Rodis Wabina** <br>
Student, PhD Data Science in Healthcare and Clinical Informatics <br>
Clinical Epidemiology and Biostatistics, Faculty of Medicine (Ramathibodi Hospital) <br>
Mahidol University

Note: In case of Python Markdown errors, you may access the assignment through this GitHub [Link](https://github.com/rrwabina/RADI608/tree/main/Submitted)

### <code>Question 1: Find one journal (2021 –present) related to reinforcement learning using in healthcare, then describe (methodologies) and draw a research framework. </code>

## [Strategising Template-guided Needle Placement for MR-targeted Prostate Biopsy](https://arxiv.org/abs/2207.10784)

```
Gayo, I. J., Saeed, S. U., Barratt, D. C., Clarkson, M. J., & Hu, Y. (2022). Strategising template-guided needle placement for MR-targeted prostate biopsy. In MICCAI Workshop on Cancer Prevention through Early Detection (pp. 149-158). Springer, Cham.
```

## 1. Introduction

### 1.1 Motivations and Objectives
Existing biopsy procedures on prostate cancer are highly limited by the operator-dependent skills and experience in sampling the lesions found in pre-operative magnetic resonance (MR) images. These procedures require advanced physician's capabilities to target these lesions during ultrasound-guided biopsy, which may produce false positive and false negative detections on prostate cancer patients. Because of these problems, physicians utilize multiparametric MR imaging (mpMRI) techniques, such as needle sampling, to provide a noninvasive localization of a suspected prostate cancer using MR images. However, needle sampling of the MR-identified targets are still challenging tasks since physician's expertise is a significant predictor in detecting clinically significant prostate cancer. 

Recent development in mpMRI only utilize MR images to create planning strategies as a manual navigation of the physicians during biopsy through segmentation, lesion detection, and navigation optimization using deep learning techniques. However, these methods have shown to yield an insufficient sampling of the heterogeneous and incomplete cancer lesion and an inferior diagnostic accuracy in terms of cancer-representative grading. 

Several studies have confirmed that needle deployments in other pre-operative surgeries provides better performance. However, existing studies have shown that there has not been any computer-assisted needle sampling strategy that optimizes patient-and-lesion-specific needle distribution. Therefore, Gayo et al. (2022) investigated the feasibility of using Reinforcement Learning (RL) to plan the patient-specific needle sampling strategies for prostate cancer biopsy procedures. 


## 2. Methodology

The <code>agent-environment</code> interactions are modelled as a Markov Decision Process where $S$, $A$, $r$ and $p$ refers to the state, actions, rewards, and state transition probability. Hence, the MDP is described as a 4-tuple $(S, A, r, p)$. Gayo et al. (2022) first developed an environment for template-guided biopsy sampling of the cancer targets, the MDP components, and the policy learning strategy

### 2.1 Markov Decision Process (MDP) components

<code>State</code>: The agent receives information about its current state $s_t \in S$ from the environment at a given time point $t$ during the procedure. The current state (i.e., template grid position) is processed to the current template grid position in policy evaluation. The current position is determined by the previous action. 

<code>Action</code>: The agent proceeds actions $a_t \in A$ by taking its position on the template grid. These actions are relative to the current position on the template grid of the agent $(i, j)$ and are defined as $a_t = (\delta_i, \delta_j)$, such that the new position is given by $(i+\delta_i, j+\delta_j)$ where $\delta_i, \delta_j \in [-15, 15]$. The authors considered the virtual biopsy needles as positioned on the image plane, with an insertion depth that overlaps the needle center and center of the observed 2D target.

<code>Rewards</code>: The reward at time $t$, formulated as $R_t = r(s_t, a_t)$ during training. **The agent is rewarded positively if the virtual biopsy needles obtain lesion samples.** Gayo et al. (2022) uses a high reward of <code>+5</code> to lead to a faster convergence during training. Moreover, higher reward encourages the agent to hit the lesions faster. However, **a penalty of <code>-1</code> is given when the chosen grid positions from the template grid are outside of the prostate.** This penalty is given since it avoids to hit the surrounding critical structures near the lesion and other prostate tissues. Reward shipping is also adopted to guide the agent towards the lesion by using a sign function $\text{Sgn}$ of the difference between the $\text{dist}_{t-1}$ and $\text{dist}_{t+1}$ where $\text{dist}_{t}$ represents the Euclidean distance between the target and needle trajectory at time $t$. 

### 2.2 Patient-specific prostate MR-derived Biopsy Environment

<code>Environment</code>: Gayo et al. (2022) developed an environment for template-guided biopsy sampling of the cancer targets. The environment is a **2D-slice of an MR-derived** biopsy environment where virtual biopsy needles can be inserted through the perineum via a brachytherapy template gtid consisting of $13 \times 13$ holes that are 5 mm apart. 

The following points are certain considerations the authors assumed in designing and constructing the adopted biopsy environment.

1. The prostate gland from each MR volume, the MR-identified targets, and key landmarks such as the position of the rectum are all **manually segmented from indivudal patients to construct the biopsy environment**. 
2. **Binary segmentation are provided as observations for the RL agents**. 
3. Uncertainty in MR-to-Ultrasound registration can be added to the segmented regions, together with other potential erros in localizing these errors during observation nsuch as observer variability in manual segmentation. 

<center>
<img src = "figures/environment.JPG" width = "1450"/> <br>
</center>

The environment receives optimal actions from the agent that were optimized using a policy evaluation framework. This agent framework is discussed in Section 2.3. The environment uses the chosen action to insert the virtual biopsy needles through the template grid and evaluates its corresponding reward. Once a reward is given to the action, new template grid positions from the template grid will be initialized by the environment that serves as the random starting positions for the next policy evaluation. The environment then sends random starting positions on the template grid to the agent for policy evaluation. 

### 2.3 Agent

The agent's goal is to maximize the expected reward it got from the environment. To do so, the agent should come up with a sampling strategy that gives a probability distribution over actions that can be executed in each state, then when in state $s$. sample action $a$ according to that distribution $\pi(s, \cdot)$, and repeat. In this case, the sampling strategy is parameterised by **ResNet18** that serves as the policy neural network $\pi_\theta$ with parameters $\theta$. The ResNet18 is a convolutional neural network (CNN) that consists of 18 deep layers, primarily used for image classification. In this study, Gayo et al. (2022) used ResNet18 to quantify the probability of an action $a_t$ given state $s_t$. The actions of the agent is sampled from the policy $a_t \rightarrow \pi_\theta(\cdot |s_t)$. The accumulated reward is maximized given as

$$Q^{\pi_\theta}(s_t, a_t) = \sum_{k=0}^{T}\gamma^k R_{t+k}$$

where $\gamma = 0.9$ is the discount factor. The accumulated reward is evaluated for policy evaluation to determine how good a particular policy is. It starts with arbitrary values for each state and then iterative updtaes the values based on the Bellman equations until the values converge. With continuous actions, the policy is improved by optimizing the parameters using the Policy Gradient / Actor-Critic (PG/AC) algorithms, denoted as follows:

$$ \pi_{\theta^*} = \text{arg max}_{\theta} \mathbf{E}_{\pi_\theta} [Q^{\pi_\theta}(s_t, a_t)]$$

The policy improvement equation above, also known as policy iteration, starts with a random policy in the policy evaluation to iteratively improve the policy until an optimal policy is obtained. It is, however, slow, due to the policy evaluation loop within the policy iteration loop.

<center>
<img src = "figures/agentframework.JPG" width = "950"/> <br>
</center>

### 2.4 Agent-Environment Interactions


Gayo et al. (2022) utilized T2-weighted MR images, which were manually segmented from 54 prostate cancer patients. These datasets were obtained from the PROMIS and SmartTarget clinical trials. Once the dataset has been preprocessed and segmented, the 2D slices were utilized as environments where virtual biopsy needles are inserted through the perineum via a template grid consisting of $13 \times 13$ holes that are 5 mm apart. The environment then chooses random starting positions in the template grid to locate the targets (i.e., prostate lesions).

To locate the targets, the environment sends the random starting positions on the template grid to the agent. The agent uses these positions to be evaluated, where the agent was trained for each patient at 120,000 episodes using the Stable Baselines implementation of the Proximal Policy Optimization (PPO). Each episode was limited to a maximum of 15 time steps, but can terminate early if five needles hit the lesion. Once the agent obtained the most optimal policy through a continuous policy evaluation and improvement, the agent sends the appropriate actions that are sampled from the policy. The agent takes actions that can modify the position on the template grid that are relative to the current position of the agent.

The environment then receives the actions from the agent, where it inserts virtual biopsy needles to the template grid based on the position chosen by the agent. The agent will be rewarded with $+5$ if the biopsy needle hits lesion samples. However, a penalty worth $-1$ is given to the agent when it chooses grid positions that are outside the prostate. Otherwise, its reward is determined by $\text{Sgn} (\text{dist}_{t-1} - \text{dist}_{t})$. If the environment finished rewarding the agent, it will again initialized random positions on the grid to be evaluated on the agent..

Generally, at each step, the agent outputs an action, which is input to the environment. The environment evolves according to its dynamics, the agent observes the new state of the environment and (optionally) a reward, and the process continues until hopefully the agent learns what behavior maximizes its reward.

<center>
<img src = "figures/modelframework.JPG" width = "1450"/> <br>
</center>

<center>
<img src = "figures/step.JPG" width = "1450"/> <br>
</center>