### Cart Pole With an Explicit Policy

The explicit policy used in this cart pole problem is described below: <br />
In each state, examine the direction of pole angle and the direction of pole velocity: if they are both negative, push cart towards left; if they are both positive, push cart towards right; otherwise, examine the direction of cart position and cart velocity: if their product value is positive (meaning cart velocity follows the direction of cart position), push cart towards the opposite direction of cart position; otherwise push cart towards the opposite direction of cart velocity.

In [1]:
import gym

In [2]:
# get cart pole environment
env = gym.make("CartPole-v1")
# goal: prevent cart from falling over (-2.4 < cart_position < 2.4 and -12 degrees < pole_angle < 12 degrees) for 200 episodes
# action: 0 (push cart towards left), 1 (push cart towards right)
# observation (state): [cart_position, cart_velocity, pole_angle, pole_velocity_at_tip]

In [3]:
def run_episode():
    total_reward = 0
    time_step = 0
    observation = env.reset()
    for time_step in range(200):
        env.render()
        # get cart position, cart velocity, pole angle, and pole velocity at tip from observation (state)
        cart_position, cart_velocity, pole_angle, pole_velocity_at_tip = observation
        # apply an explicit policy
        if pole_angle < 0 and pole_velocity_at_tip < 0:
            action = 0
        elif pole_angle > 0 and pole_velocity_at_tip > 0:
            action = 1
        else:
            if cart_position < 0:
                if cart_velocity < 0:
                    action = 1
                else:
                    action = 0
            else:
                if cart_velocity > 0:
                    action = 0
                else:
                    action = 1
        observation, reward, done, info = env.step(action)
        total_reward += reward
        if done:
            break
    return total_reward, time_step + 1

In [4]:
# test episode runs
all_reward = 0
episodes = 20
for episode in range(episodes):
    total_reward, time_step = run_episode()
    all_reward += total_reward
    print(f"Episode:{episode:2d}, Reward:{total_reward:3.2f}, Timestep:{time_step:3d}")
print(f"Over {episodes} episodes, average reward: {all_reward / episodes}")

Episode: 0, Reward:150.00, Timestep:150
Episode: 1, Reward:200.00, Timestep:200
Episode: 2, Reward:200.00, Timestep:200
Episode: 3, Reward:200.00, Timestep:200
Episode: 4, Reward:169.00, Timestep:169
Episode: 5, Reward:200.00, Timestep:200
Episode: 6, Reward:149.00, Timestep:149
Episode: 7, Reward:154.00, Timestep:154
Episode: 8, Reward:200.00, Timestep:200
Episode: 9, Reward:200.00, Timestep:200
Episode:10, Reward:200.00, Timestep:200
Episode:11, Reward:200.00, Timestep:200
Episode:12, Reward:200.00, Timestep:200
Episode:13, Reward:200.00, Timestep:200
Episode:14, Reward:200.00, Timestep:200
Episode:15, Reward:200.00, Timestep:200
Episode:16, Reward:162.00, Timestep:162
Episode:17, Reward:200.00, Timestep:200
Episode:18, Reward:200.00, Timestep:200
Episode:19, Reward:200.00, Timestep:200
Over 20 episodes, average reward: 189.2


In [5]:
# close cart pole environment
env.close()