### Value Iteration

Value Iteration is a dynamic programming algorithm used to solve Markov Decision Processes (MDPs). It computes the optimal policy and the value function for an agent interacting with an environment. The algorithm iteratively updates the value of each state until it converges to the optimal value function.

#### Key Concepts:
1. **Value Function (V)**: Represents the maximum expected reward an agent can achieve starting from a given state and following the optimal policy.
2. **Bellman Equation**: The core of value iteration, it expresses the relationship between the value of a state and the values of its successor states.

#### Algorithm Steps:
1. **Initialization**: Start with an arbitrary value function $V(s)$ for all states $s$ (e.g., $V(s) = 0$).
2. **Update Rule**: For each state $s$, update its value using:
    $$
    V(s) = \max_a \sum_{s'} P(s' | s, a) \left[ R(s, a, s') + \gamma V(s') \right]
    $$
   where:
    - $P(s' | s, a)$: Transition probability from state $s$ to $s'$ under action $a$.
    - $R(s, a, s')$: Reward received when transitioning from $s$ to $s'$ under action $a$.
    - $\gamma$: Discount factor ($0 \leq \gamma < 1$) that prioritizes immediate rewards over future rewards.
3. **Convergence**: Repeat the update rule until the value function converges (i.e., the change in $V(s)$ is smaller than a predefined threshold).

#### Optimal Policy:
Once the value function converges, the optimal policy $\pi^*(s)$ can be derived as:
$$
\pi^*(s) = \arg\max_a \sum_{s'} P(s' | s, a) \left[ R(s, a, s') + \gamma V(s') \right]
$$

#### Advantages:
- Guarantees convergence to the optimal value function and policy.
- Simple and effective for small state and action spaces.

#### Limitations:
- Computationally expensive for large state or action spaces due to the need to evaluate all states and actions in each iteration.
- Requires knowledge of the transition probabilities and rewards, which may not always be available.

Value Iteration is widely used in reinforcement learning and decision-making problems where the environment can be modeled as an MDP.

In [2]:
import numpy as np
import gym

In [3]:
def value_iteration(env, gamma = 1.0):
    # Initialize the value function for all states to zero
    v = np.zeros(env.observation_space.n)
    
    # Set the maximum number of iterations and the convergence threshold
    max_iterations = 200000
    eps = 1e-5
    
    # Iterate to update the value function
    for i in range(max_iterations):
        # Make a copy of the current value function to track changes
        v_ = np.copy(v)
        
        # Loop through each state in the environment
        for state in range(env.observation_space.n):
            # Compute the value of each action by summing over all possible transitions
            q = [sum([prob * (reward + gamma * v_[state_]) 
                      for prob, state_, reward, _ in env.env.P[state][action]]) 
                 for action in range(env.action_space.n)]
            
            # Update the value of the current state to the maximum value of all actions
            v[state] = max(q)
        
        # Check for convergence by comparing the change in value function
        if (np.sum(np.fabs(v_ - v)) <= eps):
            print('Value-iteration converged at iteration %d.' % (i + 1))
            break

    # Return the optimal value function
    return v


In [4]:
def policy_extraction(env, v, gamma = 1.0):
    # Initialize the policy with zeros for all states
    policy = np.zeros(env.observation_space.n)
    
    # Loop through each state in the environment
    for state in range(env.observation_space.n):
        # Initialize an array to store the value of each action
        q = np.zeros(env.action_space.n)
        
        # Loop through each action available in the current state
        for action in range(env.action_space.n):
            # Compute the value of the action by summing over all possible transitions
            for prob, state_, reward, _ in env.env.P[state][action]:
                q[action] += prob * (reward + gamma * v[state_])
        
        # Select the action with the highest value as the optimal action for the current state
        policy[state] = np.argmax(q)

    # Return the extracted policy
    return policy

In [8]:
def run_episode(env, policy, gamma = 1.0, render = False):
    # Reset the environment to the initial state
    observation = env.reset()
    observation = observation[0]  # Extract the initial observation
    total_reward = 0  # Initialize the total reward
    step_idx = 0  # Initialize the step counter

    while True:
        if render:
            env.render()  # Render the environment if specified
        # Take the action dictated by the policy for the current state
        observation, reward, done, truncated, _ = env.step(int(policy[observation]))
        # Accumulate the discounted reward
        total_reward += (gamma ** step_idx * reward)
        step_idx += 1  # Increment the step counter
        if done:  # Exit the loop if the episode is finished
            break

    return total_reward  # Return the total reward for the episode

def test_episode(env, policy):
    # Reset the environment to the initial state
    observation = env.reset()
    observation = observation[0]  # Extract the initial observation

    while True:
        env.render()  # Render the environment
        # Take the action dictated by the policy for the current state
        observation, _, done, truncated, _ = env.step(int(policy[observation]))
        if done:  # Exit the loop if the episode is finished
            break

def evaluate_policy(env, policy, gamma = 1.0, n = 100):
    # Run multiple episodes and calculate the average score
    scores = [run_episode(env, policy, gamma, False) for _ in range(n)]
    return np.mean(scores)  # Return the mean score across episodes


if __name__ == '__main__':
    env_name = 'FrozenLake8x8-v1' 
    env = gym.make(env_name)
    optimal_v = value_iteration(env, gamma = 1.0)
    policy = policy_extraction(env, optimal_v, gamma = 1.0)
    score = evaluate_policy(env, policy, gamma = 1.0)
    print('Average scores = ', np.mean(score))
    test_episode(env, policy)

Value-iteration converged at iteration 813.
Average scores =  1.0


  logger.warn(


In [6]:
env = gym.make(env_name, render_mode='human')
test_episode(env, policy)

KeyboardInterrupt: 

In [9]:
optimal_v.reshape(8, 8)

array([[0.99998899, 0.99998947, 0.99999016, 0.99999092, 0.99999168,
        0.99999242, 0.99999307, 0.99999355],
       [0.99998887, 0.99998923, 0.99998983, 0.99999055, 0.99999132,
        0.99999209, 0.99999289, 0.99999383],
       [0.99997859, 0.97818182, 0.92641358, 0.        , 0.85660943,
        0.94622381, 0.98207026, 0.99999438],
       [0.99996925, 0.93457851, 0.80106982, 0.47489431, 0.62361351,
        0.        , 0.94467201, 0.99999518],
       [0.99996124, 0.8255856 , 0.54221824, 0.        , 0.53933714,
        0.61118455, 0.85195085, 0.99999619],
       [0.99995491, 0.        , 0.        , 0.16803797, 0.3832136 ,
        0.44226587, 0.        , 0.99999736],
       [0.99995053, 0.        , 0.19466348, 0.12090042, 0.        ,
        0.33239961, 0.        , 0.99999865],
       [0.99994829, 0.73151853, 0.46309046, 0.        , 0.27746651,
        0.55493304, 0.77746651, 0.        ]])

In [10]:
policy.reshape(8, 8)

array([[3., 2., 2., 2., 2., 2., 2., 2.],
       [3., 3., 3., 3., 3., 3., 3., 2.],
       [0., 0., 0., 0., 2., 3., 3., 2.],
       [0., 0., 0., 1., 0., 0., 2., 2.],
       [0., 3., 0., 0., 2., 1., 3., 2.],
       [0., 0., 0., 1., 3., 0., 0., 2.],
       [0., 0., 1., 0., 0., 0., 0., 2.],
       [0., 1., 0., 0., 1., 2., 1., 0.]])