<a href="https://colab.research.google.com/github/shengy90/reinforcement-learning-an-introduction/blob/master/Chapter_5_Monte_Carlo_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5. Monte Carlo Methods
- Monte Carlo methods are a class of methods that do not assume complete knowledge of the environment
- it requires only experience, i.e. a sample sequences of states, actions and rewards from actual or simulated interaction with an environment 
- it requires no prior knowledge of the environment's dynamics
- Monte Carlo methods sample and average returns for each state-action pair 
- unlike DP where we *computed* value functions, Monte Carlo methods *learn* the value functions from sample returns 
- however like DP, it's still an iterative process of 'policy evaluation' and 'policy improvements'

# 5.1 Monte Carlo Prediction

**Basic principle of Monte Carlo prediction**
- recall that value of a state is just the expected return starting from that state 
- so with large enough iteration, simply averaging the returns observed after visits to the state should converge to the expected value!

**Example:**
- suppose we wish to estimate $v_\pi(s)$ - the value of state $s$ under policy $\pi$, given a set of episodes obtained by following $\pi$ and passing through $s$ 
- each occurence of state $s$ is called a *visit* to state $s$
- $s$ may be visited multiple times in the same episode 
- *first-visit MC method* estimtes $v_\pi(s)$ = average returns following first visits to s
- *every-visist MC method* estimates average returns following **all** visits to s

We'll focus 'first-visit MC method' in this chapter.

##### **First Visit MC Prediction Psedo code**

`Input`: a policy $\pi$ to be evaluated 

`Initialise`: 
- $V(s) \epsilon \mathbb{R}$ arbitrarily for all $s \epsilon S$
- Returns(s) : an empty list for all $s \epsilon S$

> Loop forever (for each episode):
- Generate an episode following $\pi$ : $S_0$, $A_0$, $R_1$, $S_1$,....,$S_{T-1}$,$A_{T-1}$,$R_T$
- Let G = 0
    - Loop for each step of episode, $t = T-1, T-2 ... 0$:
        - $G = \gamma G + R_{t+1}$
        - Unless $S_t$ appears in $S_0$, $S_1$....,$S_{t-1}$:
            - Append G to Returns($S_t$)
            - $V(S_{t})$ = average(Returns($S_t$))



**Both first-visit MC and every-visit MC converges to $v_\pi(s)$ as the number of visits goes to infinity**. Each return is assumed to be an independent and identically distributed estimate of $v_\pi(s)$. By law of large number, the average of this sequence will therefore converge to the expevted value, with a standard deviation of $\frac{1}{\sqrt{n}}$.

# 5.2 Monte Carlo Estimation of Action Values

**In situations where model is not available,** it's useful to estimate action values (values of state-action pairs) rather than state values. **With a model however**, state values alone are enough to determine a policy. 

As mentioned, Monte Carlo methods are super useful when a model of the environment is not available. For this reason, we're more interested in estimating $q_*$. 

**Recall in policy evaluation problem**, we estimate $q_{\pi}(s,a)$ - the expected return when starting in state s and taking action a folowing policy $\pi$. This is similar in Monte Carlo methods, except that we now talk about 'visits to a state-action pair' rather than to a state. 

A `state-action pair (s,a`) being visited means state s is visited **and** action a is taken in it. In *first-visit MC*, we average returns following the first time `(s,a)` was visited. 

**What if many `state-action` pairs were never visited?** 

With no returns to average, the MC estimates of other actions will not improve with experience! Recall the trade off between 'optimisation' and 'exploration' in chapter 2? For policy evaluation to work, we must ensure *continual exploration*. 

For now, let's assume that every state-action pairs have a non-zero probability of being selected, such that as time $\to \infty$, all state-action pairs will be selected $\infty$ times (the assumption called *`exploring starts`*). In future chapters, we can expand this to only consider stochastic policies with nonzero posibility of selecting all actions in each state. 

# 5.3 Monte Carlo Control

# 5.4 Monte Carlo Control without Exploring Starts

#5.5 Off-policy prediction via Importance Sampling

# 5.6 Incremental Implementation

# 5.7 Off-policy Monte Carlo Control

# 5.8 Discounting-aware Importance Sampling

# 5.9 Per-decision Importance Sampling

# BAHASDASDA