# Monte Carlo Methods in Reinforcement Learning

Monte Carlo methods are a class of algorithms that rely on repeated random sampling to obtain numerical results. In the context of reinforcement learning, Monte Carlo methods are used to estimate the value of states or state-action pairs based on observed episodes.

## Monte Carlo Prediction

Monte Carlo prediction methods are used to estimate the value function \( V(s) \) for a given policy $\pi$ based on the average return from multiple episodes. The value of a state is updated by averaging the returns observed after visiting that state.

### Process:
1. **Generate Episodes**: Run the policy $\pi$ to generate multiple episodes.
2. **Calculate Returns**: For each state in each episode, calculate the return (cumulative reward) from that state to the end of the episode.
3. **Update Value Function**: Average the returns for each state over multiple episodes to estimate the value function.

### Methods:
- **Every-Visit**: Updates the value function using the average of all returns observed each time a state is visited.
- **First-Visit**: Updates the value function using the average of returns observed only the first time a state is visited in each episode.

### Equation:
---

- **Every-Visit**:
>$$V(s) \approx \frac{1}{N(s)} \sum_{i=1}^{N(s)} G_i(s)$$

- **First-Visit**:
>$$V(s) \approx \frac{1}{N_{\text{first}}(s)} \sum_{i=1}^{N_{\text{first}}(s)} G_i(s)$$


Where:
- $N(s)$ is the number of times state $s$ has been visited
- $N_{\text{first}}(s)$ is the number of times state $s$ has been visited for the first time in each episode, and $G_i(s)$ is the return observed from state $s$ in the $i$-th episode.


## Monte Carlo Control

Monte Carlo control methods are used to find the optimal policy $pi^*$ by learning from episodes generated by the current policy. There are two main approaches: on-policy and off-policy learning.

### On-policy Monte Carlo Control

On-policy methods update the policy based on the actions taken by the current policy. The policy is improved iteratively using the estimated action-value function $Q(s, a)$.

### Off-policy Monte Carlo Control

Off-policy methods learn the value of the optimal policy $\pi^*$ while following a different behavior policy $\mu$. This approach uses importance sampling to correct for the difference between the behavior policy and the target policy.


## Off-policy and On-policy Learning

### On-policy Learning
In on-policy learning, the policy used to generate the episodes is the same as the policy being improved. The action-value function $Q(s, a)$ is updated using the returns observed from the episodes generated by the current policy.

### Off-policy Learning
In off-policy learning, the behavior policy $\mu$ is different from the target policy $\pi$. The action-value function $Q(s, a)$ is updated using importance sampling to correct for the difference between the behavior policy and the target policy.

### Importance Sampling
Importance sampling is used to weigh the returns observed from the behavior policy \( \mu \) to estimate the returns for the target policy \( \pi \).

### Equation:
---


>$$Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \frac{\pi(a|s)}{\mu(a|s)} \left( G - Q(s, a) \right)$$

Where:
- $G$ is the return observed from the episode 
- $\alpha$ is the learning rate.


## Epsilon-Greedy Policy

The epsilon-greedy policy is a method to balance exploration and exploitation. It ensures that the agent explores the environment by choosing random actions with probability $\epsilon$ and exploits the best-known actions with probability $1 - \epsilon$.

### Epsilon-Greedy Policy Algorithm

1. **Initialization**: Set $\epsilon$ (exploration rate).
2. **Action Selection**:
   - With probability $\epsilon$, choose a random action.
   - With probability $1 - \epsilon$, choose the action that maximizes the estimated value.

## Monte Carlo Algorithm

The Monte Carlo algorithm involves generating episodes, calculating returns, and updating value estimates based on empirical averages.

### Pseudocode:

1. Initialize value function $V(s)$ arbitrarily
2. For each episode:
    - Generate episode using policy $\pi$
    - Calculate return $G$ from state $s$
    - Update $V(s)$ using empirical average of returns

### Monte Carlo Control Algorithm (On-policy):

1. Initialize $Q(s, a)$ arbitrarily and $\pi$ to be epsilon-greedy
2. For each episode:
    - Generate episode using policy $\pi$
    - For each state-action pair $(s, a)$ in episode:
        - Calculate return $G$ from $(s, a)$
        - Update $Q(s, a)$ using empirical average of returns
        - Update policy $\pi$ to be greedy with respect to $Q$

## Example Implementation of Monte Carlo Methods in Python

Below is an example implementation of Monte Carlo prediction and control methods in Python.


In [1]:
import numpy as np
from collections import defaultdict

# Define the environment and policy
states = [0, 1, 2, 3, 4]
actions = ['a', 'b']
policy = {s: np.random.choice(actions) for s in states}

# Simulate an environment
def generate_episode(policy):
    episode = []
    state = np.random.choice(states)
    while state != 4:  # Terminal state
        action = policy[state]
        next_state = np.random.choice(states)
        reward = np.random.randn()  # Random reward
        episode.append((state, action, reward))
        state = next_state
    return episode

# Monte Carlo prediction (First-Visit)
def monte_carlo_prediction_first_visit(policy, episodes, gamma=0.9):
    V = defaultdict(float)
    returns = defaultdict(list)
    for _ in range(episodes):
        episode = generate_episode(policy)
        G = 0
        visited = set()
        for t in reversed(range(len(episode))):
            state, _, reward = episode[t]
            G = gamma * G + reward
            if state not in visited:
                visited.add(state)
                returns[state].append(G)
                V[state] = np.mean(returns[state])
    return V

# Run Monte Carlo prediction (First-Visit)
value_function = monte_carlo_prediction_first_visit(policy, episodes=1000)

print("Estimated State-Value Function:")
for state, value in value_function.items():
    print(f"V({state}) = {value:.2f}")


Estimated State-Value Function:
V(3) = -0.03
V(1) = 0.11
V(2) = 0.11
V(0) = 0.01


In [2]:
# Monte Carlo control (on-policy, Every-Visit)
def monte_carlo_control_on_policy(episodes, gamma=0.9, epsilon=0.1):
    Q = defaultdict(lambda: np.zeros(len(actions)))
    policy = {s: np.random.choice(actions) for s in states}

    def epsilon_greedy_policy(state):
        if np.random.rand() < epsilon:
            return np.random.choice(actions)
        else:
            return actions[np.argmax(Q[state])]

    for _ in range(episodes):
        episode = []
        state = np.random.choice(states)
        while state != 4:  # Terminal state
            action = epsilon_greedy_policy(state)
            next_state = np.random.choice(states)
            reward = np.random.randn()  # Random reward
            episode.append((state, action, reward))
            state = next_state

        G = 0
        for t in reversed(range(len(episode))):
            state, action, reward = episode[t]
            G = gamma * G + reward
            if (state, action) not in [(x[0], x[1]) for x in episode[:t]]:
                Q[state][actions.index(action)] += (G - Q[state][actions.index(action)]) / len(episode)
                policy[state] = actions[np.argmax(Q[state])]
                
    return policy, Q

# Run Monte Carlo control (on-policy, Every-Visit)
optimal_policy, action_value_function = monte_carlo_control_on_policy(episodes=1000)

print("\nOptimal Policy:")
for state, action in optimal_policy.items():
    print(f"State {state}: {action}")

print("\nEstimated Action-Value Function:")
for state, values in action_value_function.items():
    for action, value in zip(actions, values):
        print(f"Q({state}, {action}) = {value:.2f}")



Optimal Policy:
State 0: a
State 1: b
State 2: b
State 3: a
State 4: b

Estimated Action-Value Function:
Q(2, a) = -0.50
Q(2, b) = -0.06
Q(0, a) = 0.12
Q(0, b) = -1.13
Q(1, a) = -2.84
Q(1, b) = 0.33
Q(3, a) = -0.33
Q(3, b) = -0.80


The estimated state-value function provides the expected return for each state under the given policy. The optimal policy shows the best action to take in each state based on the learned action-value function.