# Apply Monte Carlo(MC) method to environment
- 발표자 : 최찬혁

This code is a code that applies Monte Carlo(MC) method to environment "FrozenLake-v1" from OpenAI gym.

In this code, we use **every-visit** MC method.

## Monte Carlo(MC) method
MC method is tabular updating and model-free


- Goal : learn $q_{\pi}$ from entire episodes of real experience under policy $\pi$
    - entire trajectory of an episode : $S_0 , A_0 , R_1, \cdots, S_{T-1}, A_{T-1}, R_T$
    
    - return : $G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma ^{T-t-1}R_T$ 
    
    - action-value function : $q_{\pi} \left( s, a\right) = \mathbb{E}_{\pi} \left[ G_t | S_t =s, A_t = a\right]$

 
- MC policy Evaluation uses **empirical mean** return instead of expected return


- $Q\left( s, a \right) \rightarrow q_{\pi} \left( s, a \right)$ as $n\left( s, a \right) \rightarrow \infty$ by the law of large numbers with assumption of i.i.d returns

There are 2 ways to compute empirical mean.

For $x_1, x_2, \cdots, x_n$, the empirical mean is $\mu _{n} = \frac{x_1 + x_2 + \cdots + x_n}{n}$.

Since we have 
- $x_1 + x_2 + \cdots + x_{n-1} = \left( n-1 \right) \mu _{n-1}$ and
- $x_1 + x_2 + \cdots + x_{n} = n \mu _{n}$,

we obtain that $x_n = n\mu_{n} - \left( n-1 \right) \mu _{n-1}$, that is, $\mu_{n} = \mu_{n-1} + \frac{1}{n}\left( x_n - \mu_{n}\right)$  (incremental updates).

Instead of using incremental updates, we can use the following update.
- $\mu_{n} = \mu_{n-1} + \alpha \left( x_n - \mu_{n}\right)$ (constant-$\alpha$ updates)

The constant-$\alpha$ update prioritizes more recent samples. 
It is preferable to use this method since recent samples are more valuable due to udpated policy.

Likewise, we can update the Q-table by the following update equations:
- $n(S_t, A_t) \leftarrow n(S_t,A_t) + 1$ 
    - $Q(S_t, A_t)\leftarrow Q(S_t,A_t) + \frac{1}{n(S_t,A_t)}(G_t - Q(S_t, A_t))$
- $Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha(G_t - Q(S_t, A_t))$

A pseudo code of MC method is the following.

![pseudo code](./MC_Algorithm.JPG)

## Environment(FrozenLake-v1)

This game involves crossing a frozen lake from Start(S) to Goal(G) without falling into any Holes(H) by walking over the Frozen(F) lake. The agent may not always move in the intended direction due to the slippery nature of the frozen lake.

The game will be terminated if you reach **Goal(G) or Holes(H)**.

Reward will be awarded $1$ if you reach **Goal(G)** and $0$ for otherwise.

In [1]:
import gymnasium as gym
import numpy as np
import random

During loading the environment, we can give various options.

- desc : Used to specify custom map for frozen lake. (It means that we can decide the positions of Start, Hole, Frozen, and Goal.

- map_name : ID to use any of the preloaded maps. (4 * 4 or 8 * 8)

- is_slippery(boolean) :  If True will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions. (For example, if action is left and is_slippery is True, then move left(or up/down) with probability 1/3.)

![FrozenLake-env-4X4](./Frozen_Lake_v1_4X4.JPG)
![FrozenLake-env-8X8](./Frozen_Lake_v1_8X8.JPG)

In [2]:
env = gym.make("FrozenLake-v1", desc=None, map_name="4x4", is_slippery=False)
env.reset()

(0, {'prob': 1})

In [3]:
env.observation_space

Discrete(16)

Observation will be an integer in $\left\{ 0, 1, \cdots , 15\right\}$.

This number means the location of character.

![Location](./observation_space.JPG)

For example, initial observation must be $0$ and the env will be terminated if observation is $5,7,11,12$ and $15$.

In [4]:
env.action_space

Discrete(4)

Action will be an integer $0, 1, 2$ and $3$.

$0$ : LEFT

$1$ : DOWN

$2$ : RIGHT

$3$ : UP

In [5]:
env.close()

## Hyperparameters

In [6]:
epsilon_initial = 1.0
epsilon_decay = 0.999
epsilon_min = 0.05
MAX_EPISODE = 20000
GAMMA = 0.95

# alpha
step_size = 0.02 

## Incremental Monte Carlo updates

- $n\left( S, A\right) \leftarrow n\left( S, A\right) + 1$
- $Q\left( S, A\right) \leftarrow Q\left( S, A\right) + \frac{1}{n\left( S, A\right)} \left[ G - Q\left( S, A\right)\right]$

In [21]:
env1 = gym.make("FrozenLake-v1", desc=None, map_name="4x4", is_slippery=False)
env1.reset()

(0, {'prob': 1})

In [22]:
Q_table1 = np.zeros((env1.observation_space.n, env1.action_space.n))

In [23]:
Q_table1

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [24]:
n_table = np.zeros((env1.observation_space.n, env1.action_space.n))

### $\epsilon$-greedy
Choose the greedy action with probability $1-\epsilon$ and a random action with probability $\epsilon$. (Same probability for each actions)

(c) part of the algorithm figure. 

In [25]:
def get_action1(Q_table, state, epsilon):
    tmp = random.random()
    if tmp < epsilon: # random action with probability epsilon
        return np.random.randint(env1.action_space.n)
    # greedy action with probability 1 - epsilon
    return np.argmax(Q_table[state])

In [26]:
for episode in range(MAX_EPISODE):
    states = []
    actions = []
    rewards = []
    
    terminated = False # To start an episode
    epsilon = np.max([epsilon_decay ** episode, epsilon_min])
    
    state,_ = env1.reset()
    
    # Part (a) of the pseudo-code above
    while not terminated: 
        states.append(state)
        action = get_action1(Q_table1, state, epsilon)
        actions.append(action)
        next_state, reward, terminated,_,_ = env1.step(action)
        rewards.append(reward)
        state = next_state
    
    G = 0
    T = len(states)
    
    # Part (b) of the pseudo-code above
    for t in reversed(range(T)): 
        G = (GAMMA * G) + rewards[t]
        n_table[states[t], actions[t]] = n_table[states[t], actions[t]] + 1
        Q_table1[states[t], actions[t]] = Q_table1[states[t], actions[t]] + ((G - Q_table1[states[t], actions[t]])/n_table[states[t], actions[t]])
        
    if episode % 500 == 0: # print log
        print("episode_num:" + str(episode))
        Q_table_transform1 = np.argmax(Q_table1, axis=1)
        Q_table_transform1 = np.reshape(Q_table_transform1, (4,4))
        print(Q_table_transform1)

episode_num:0
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]
episode_num:500
[[1 2 1 2]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:1000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:1500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:2000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:2500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:3000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:3500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:4000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:4500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:5000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:5500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:6000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:6500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:7000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:7500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:

In [27]:
np.reshape(np.argmax(Q_table1, axis=1), (4,4))

array([[1, 2, 1, 0],
       [1, 0, 1, 0],
       [2, 1, 1, 0],
       [0, 2, 2, 0]])

![Result_arrow_form](./Result_1.JPG)

In [28]:
env1.close()

## Constant-$\alpha$ Monte Carlo updates
$Q\left( S, A\right) \leftarrow Q\left( S, A\right) + \alpha \left[ G - Q\left( S, A\right)\right] = \alpha G + \left( 1 - \alpha \right)Q\left( S, A\right)$

In [55]:
env2 = gym.make("FrozenLake-v1", desc=None, map_name="4x4", is_slippery=False)
env2.reset()

(0, {'prob': 1})

In [56]:
Q_table2 = np.zeros((env2.observation_space.n, env2.action_space.n))

In [57]:
def get_action2(Q_table, state, epsilon):
    tmp = random.random()
    if tmp < epsilon: # random action with probability epsilon
        return np.random.randint(env2.action_space.n)
    # greedy action with probability 1 - epsilon    
    return np.argmax(Q_table[state])

In [58]:
for episode in range(MAX_EPISODE):
    states = []
    actions = []
    rewards = []
    
    terminated = False # To start an episode
    epsilon = np.max([epsilon_decay ** episode, epsilon_min])
    
    state,_ = env2.reset()

    # Part (a) of the Pseudo-code
    while not terminated: 
        states.append(state)
        action = get_action2(Q_table2, state, epsilon)
        actions.append(action)
        next_state, reward, terminated, _, _ = env2.step(action)
        rewards.append(reward)
        state = next_state
    G = 0
    T = len(states)
    
    # Part (b) of the Pseudo-code
    for t in reversed(range(T)): 
        G = (GAMMA * G) + rewards[t]
        Q_table2[states[t], actions[t]] = (step_size * G) + ((1 - step_size) * Q_table2[states[t], actions[t]])
        
    if episode % 500 == 0:
        print("episode_num:" + str(episode))
        Q_table_transform2 = np.argmax(Q_table2, axis=1)
        Q_table_transform2 = np.reshape(Q_table_transform2, (4,4))
        print(Q_table_transform2)

episode_num:0
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]
episode_num:500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:1000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:1500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:2000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:2500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:3000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:3500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:4000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:4500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:5000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:5500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:6000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:6500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:7000
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:7500
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
episode_num:

In [59]:
np.reshape(np.argmax(Q_table2, axis=1), (4,4))

array([[2, 2, 1, 0],
       [1, 0, 1, 0],
       [2, 1, 1, 0],
       [0, 2, 2, 0]])

![Result_arrow_form](./Result_2.JPG)

In [60]:
env2.close()