In [1]:
import gym
import numpy as np

env = gym.make('CartPole-v1')

Lets see what the possible actions are

In [2]:
print(env.env.__doc__)


    Description:
        A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

    Source:
        This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson

    Observation: 
        Type: Box(4)
        Num	Observation                 Min         Max
        0	Cart Position             -4.8            4.8
        1	Cart Velocity             -Inf            Inf
        2	Pole Angle                 -24°           24°
        3	Pole Velocity At Tip      -Inf            Inf
        
    Actions:
        Type: Discrete(2)
        Num	Action
        0	Push cart to the left
        1	Push cart to the right
        
        Note: The amount the velocity is reduced or increased is not fixed as it depends on the angle the pole is pointing. This is because the center of gravity o

In [3]:
env.action_space

LEFT = 0
RIGHT = 1

Looks like we have 2 moves, left and right, in this case this means applying a force of -1 or +1 respectivly

In [58]:
def reverse(direction):
    if direction == LEFT:
        return RIGHT
    return LEFT

LOG = False

def log(*s: str):
    if not LOG:
        return
    print(s)

## Policy

My policy here is to try to keep the velocity of the pole low. So if the velocity of the pole increases too much in any given direction, then the agent will move in the corresponding direction to reverse that velocity.

I do however choose the threshold values of the pole velocity to be over and under 0 by a small amount, also taking into account the previous movement, so as to not make sharp changes to the movement that will result in the pole falling over quickly.

In [61]:
def run_episode(step_count=300):
    obs = env.reset()
    
    cart_pos = obs[0]
    cart_vel = obs[1]
    pole_ang = obs[2]
    pole_vel = obs[3]
    
    score = 0
    action = LEFT
    
    for t in range(step_count):
        env.render()
        old_action = action
        
        # Action selection
        if pole_vel < 0.02 and pole_ang < 0.02 and old_action == LEFT:
            action = LEFT
        elif pole_vel > -0.02 and pole_ang > -0.02 and old_action == RIGHT:
            action = RIGHT
        else:
            action = reverse(old_action)
        
        # Take action and add reward
        obs, r, done, _ = env.step(action)
        score += r
        
        cart_pos = obs[0]
        cart_vel = obs[1]
        pole_ang = obs[2]
        pole_vel = obs[3]
        
        log(t, obs)
        log(pole_ang, pole_vel)
        
        if done:
            log("Done after {} steps".format(t+1))
            break
    log("Score: {}".format(score))
    return score

In [60]:
try:
    step_count = 2000
    total_score = 0.0
    n_episodes = 20
    for ep_n in range(n_episodes):
        score = run_episode(step_count)
        print("Ep {} Score: {}".format(ep_n + 1, score))
        total_score += score
    avg_score = total_score / n_episodes
    print("Average Score in {} episodes: {}".format(n_episodes, avg_score))
finally:
    env.close()

Ep 1 Score: 500.0
Ep 2 Score: 260.0
Ep 3 Score: 500.0
Ep 4 Score: 500.0
Ep 5 Score: 415.0
Ep 6 Score: 500.0
Ep 7 Score: 170.0
Ep 8 Score: 500.0
Ep 9 Score: 500.0
Ep 10 Score: 500.0
Ep 11 Score: 500.0
Ep 12 Score: 500.0
Ep 13 Score: 500.0
Ep 14 Score: 500.0
Ep 15 Score: 500.0
Ep 16 Score: 500.0
Ep 17 Score: 500.0
Ep 18 Score: 186.0
Ep 19 Score: 245.0
Ep 20 Score: 500.0
Average Score in 20 episodes: 438.8
