# [Sutton and Barto Notebooks](https://github.com/seungjaeryanlee/sutton-barto-notebooks): Figure 4.2

[ModuAI](https://www.modu.ai)  
Author: Seung Jae (Ryan) Lee  

![Figure 4.2](figure_4_2.png)

In [1]:
from enum import IntEnum

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Environment

![Example 4.2](example_4_2.png)

In [2]:
class Environment:
    """
    The Jack's Car Rental environment described in Example 4.2.
    """
    state_space = [(i, j) for i in range(21) for j in range (21)]
    action_space = [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5]

    def is_valid(self, state, action):
        return (0 <= state[0] - action <= 20 and 0 <= state[1] + action <= 20)

    def peek(self, state, action):
        """
        Returns the result of taking given action on the given state.
        The result consists of next state and reward.
        """
        assert self.is_valid(state, action)

        state = (state[0] - action, state[1] + action)
        reward = -2 * abs(action)

        requests = [np.random.poisson(lam=3), np.random.poisson(lam=4)]
        returns = [np.random.poisson(lam=3), np.random.poisson(lam=2)]

        reward += 10 * min(state[0], requests[0])
        reward += 10 * min(state[1], requests[1])
        state = (state[0] - requests[0] + returns[0],
                 state[1] - requests[1] + returns[1])

        return state, reward

## Agent

![Policy Iteration Pseudocode](policy_iteration.png)

In [3]:
class PolicyIterationAgent:
    """
    A policy iteration agent.
    """

    def __init__(self, env):
        self.env = env
        self.state_values = np.zeros(len(env.state_space))
        self.policy = np.zeros(len(env.state_space))

    def policy_evaluation(self):
        pass
    
    def policy_improvement(self):
        pass

    def train(self):
        """
        Train agent until convergence.
        """
        policy_evaluation()
        policy_improvement()

## Plots