# CSCI 3202, Fall 2020

# Wednesday November 25, 2020

# In-class notebook:  Gym and Active Learning

<a id='top'></a>

<br>


Shortcuts:  [Top](#top) || [Frozen Lake](#lake) || [Cart-Pole](#pole) ||

<br>

---

## Overview

#### Last time we did a notebook in class for describing a path-finding agent.  It was doing a variant of Q-learning that was nearly the same as the temporal-difference model, because there was no stochastic (random) component of movement.  

There were a few takeaways from that model.

    - It didn't have to *learn* exploration, since it had a full yobservable state space or maze.
    - It chose which action to take at random, rather than prioritizing exploitation after sufficient iterations.
    - It created a (sparse) matrix of (state, state) pairs as a proxy for a true set of (state,action) pairs.  This is a bad habit for two reasons: poor memory allocation and that it doesn't generalize to many problems.  Often our actions are things like "press the accelerator," which is a little different than "try to move to known state #3459."
 

## The `gym` Environment

The Python package `gym` is a nice one that designed with Q-learning in mind, and contains a large set of example problems.  You may find its documentation at https://gym.openai.com/, and it can be installed via pip (pip3) or directly in conda/anaconda.  It is loaded below with numpy, matplot lib.

Rendering in Jupyter is sketchy, so I'll be running any cells that animate directly as Python .py notebooks in IDLE.

In [2]:
import sys
sys.path.append('c:/users/zacha/appdata/local/packages/pythonsoftwarefoundation.python.3.7_qbz5n2kfra8p0/localcache/local-packages/python37/site-packages')
import gym
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

# Problem 1: Frozen Lake

This is a path-finding problem.

In [134]:
env = gym.make('FrozenLake-v0')
env.render()
print(env.action_space)
print(env.observation_space)


[41mS[0mFFF
FHFH
FFFH
HFFG
Discrete(4)
Discrete(16)


We are navigating an 4x4 grid (in `self.observation_space`).  The states are denoted as:
- S: Start
- G: Goal
- F: Frozen Ice
- H: A hole in the ice.  We fall through and get very cold.

At any state, we can choose one of 4 actions from (in `self.action_space`):
- LEFT = 0
- DOWN = 1
- RIGHT = 2
- UP = 3

For each spot on the grid above, we have a probability associated with each action.

.P returns:

- a probability of a successor
- the actual successor
- the reward of that successor
- whether or not we're "done" with the experiment

In [137]:
print(env.P[0][0])
# print(env.P[1][2])
# print(env.P[14][2])

[(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False)]


What `gym`does nicely for us is it both saves those state spaces above and allows us to take a `step()` of the random process.  This is the same as out `transition` on the homework, and describes how the random process evolves.  It also includes all of the crucial measurements we want.  Per the official documentation, `step` returns four values. These are:



1. `observation` (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
2. `reward` (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
3. `done` (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
4.`info` (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

## Making Q:
In general, Q needs to be a thing that can be evaluated for each state-action pair.  

- Since each state is at least calling from only a small set of possible actions, one option is to create a *rectangular* array where each row is a state and each column is the Q-values associated for each action.  We'd still have to be careful to never attempt to choose an action that's invalid however, as this will include possibilities like moving "Up" when we're already in the top row.
- Another option is to create a full dictionary or nested dictionaries, where the first key is the state, the second key is the action, and the resulting list or array holds the Q-values, number of times taken, and any other needed information about that action.

I will do the first here, but if the number of possible actions is *huge* and you can only take a few of those actions in any *given* state, this would be a very memory inefficient allocation and we'd want dictionaries!

In [142]:
#https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/
alpha = 0.2
gamma = 0.2
epsilon = 0.2
env = gym.make('FrozenLake-v0')
q_table = np.zeros([env.observation_space.n, env.action_space.n])
# states=np.array([])
for i in range(1, 100001):
    state = env.reset()
    done = False
    
    while not done:
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # EXPLORE action space at prob epsilon
        else:
            action = np.argmax(q_table[state]) # else EXPLOIT learned values

        next_state, reward, done, info = env.step(action) 
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) #Q update formula, discounted
        q_table[state, action] = new_value

        state = next_state
        
    if i % 500 == 0:
        print(f"Episode: {i}")

print("Training finished.\n")

Episode: 500
Episode: 1000
Episode: 1500
Episode: 2000
Episode: 2500
Episode: 3000
Episode: 3500
Episode: 4000
Episode: 4500
Episode: 5000
Episode: 5500
Episode: 6000
Episode: 6500
Episode: 7000
Episode: 7500
Episode: 8000
Episode: 8500
Episode: 9000
Episode: 9500
Episode: 10000
Episode: 10500
Episode: 11000
Episode: 11500
Episode: 12000
Episode: 12500
Episode: 13000
Episode: 13500
Episode: 14000
Episode: 14500
Episode: 15000
Episode: 15500
Episode: 16000
Episode: 16500
Episode: 17000
Episode: 17500
Episode: 18000
Episode: 18500
Episode: 19000
Episode: 19500
Episode: 20000
Episode: 20500
Episode: 21000
Episode: 21500
Episode: 22000
Episode: 22500
Episode: 23000
Episode: 23500
Episode: 24000
Episode: 24500
Episode: 25000
Episode: 25500
Episode: 26000
Episode: 26500
Episode: 27000
Episode: 27500
Episode: 28000
Episode: 28500
Episode: 29000
Episode: 29500
Episode: 30000
Episode: 30500
Episode: 31000
Episode: 31500
Episode: 32000
Episode: 32500
Episode: 33000
Episode: 33500
Episode: 34000


In [143]:
print(q_table[0])
def policy_to_dir(policy):
    if policy==0: return 'L'
    elif policy==1: return 'D'
    elif policy==2: return 'R'
    elif policy==3: return 'U'
dirs=np.array([policy_to_dir(np.argmax(q_table[st])) for st in range(16)])

dirs.shape=(4,4)
print(dirs)
env.reset()
env.render()

[6.26006117e-07 6.47653984e-07 1.89301501e-06 5.56768076e-07]
[['R' 'D' 'R' 'L']
 ['D' 'L' 'R' 'L']
 ['R' 'D' 'D' 'L']
 ['L' 'D' 'D' 'L']]

[41mS[0mFFF
FHFH
FFFH
HFFG


In [149]:
#Now that we have a Q, we can watch an agent go!
state = env.reset()
done = False

while not done:
    action = np.argmax(q_table[state]) # Exploit learned values
    next_state, reward, done, info = env.step(action) 
    state = next_state
    env.render()


  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Down)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Right)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Right)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Right)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Right)
SFFF
FH[41mF[0mH
FFFH
HFFG
  (Right)
SFFF
FHFH
FF[41mF[0mH
HFFG
  (Down)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Down)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m


# Problem 2: Cartpole: Dealing with Continuity

This is a balancing problem! We ran a quick demo, which can be run outside Jupyter if you wish to visualize it.



In [None]:
##CELL WILL NOT WORK IN JUPYTER

import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(200):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()



In order, the 4 characteristics of the state space are:
 
- Cart Position             -4.8                    4.8
- Cart Velocity             -Inf                    Inf
- Pole Angle                -0.418 rad (-24 deg)    0.418 rad (24 deg)
- Pole Angular Velocity     -Inf                    Inf


And we can choose 2 discrete moves.
- 0    to Push cart to the left
- 1    to  Push cart to the right

In [150]:
env = gym.make('CartPole-v0')
print(env.observation_space)

print(env.observation_space.low)
print(env.observation_space.high)


Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]


This is a continuous state space!  Q-learning requires discrete action-state pairs, so maybe we make a function to bin observations into regions instead.  A binning function just chops the state space into regions like $7<x<8$ and assigns all $x$ values within that region to a single discrete value.

In [155]:
#consider binning on the first category...
bins=10000
obs=-.2
minv=env.observation_space.low[0]
maxv=env.observation_space.high[0]
prop=(obs-minv)/(maxv-minv) #numerator: distance from min value
                            #denominator: out of total length of continuous space
print(prop, ' proportion of the way through the space')
obs_bin=int(round((bins-1)*prop))
print(obs_bin, 'bin index')

0.4791666674945089  proportion of the way through the space
4791 bin index


In [156]:
#now maybe all 4 at once.  Let's cap velocities  at 100.
env = gym.make('CartPole-v0')
env.reset()

mins=env.observation_space.low
mins[1]=-100
mins[3]=-100
maxs=env.observation_space.high
maxs[1]=100
maxs[3]=100

def discretize_state(bins, state_min, state_max, obs):
    discretized=list()
    for i in range(len(obs)):
        prop=((obs[i]-state_min[i])/
                 (state_max[i]-state_min[i]))
        print(obs[i], bins[i], state_max[i], state_min[i])
        obs_bin=int(round((bins[i]-1)*prop))
        discretized.append(obs_bin)
    return tuple(discretized)

print(discretize_state((5,5,10,10), mins,maxs, [0,1,-.3,0]))

0 5 4.8 -4.8
1 5 100.0 -100.0
-0.3 10 0.41887903 -0.41887903
0 10 100.0 -100.0
(2, 2, 1, 4)


To use the same format as before, we may also want the ability to "flatten" our 4D state space into a single 1D list of actions that we can pair up with the two actions.  There are a handful of ways to do this, but `np.reshape` is one way to take a higher dimensional array and push the indices to another location.

In [14]:
#reshaping multivariate data into a single list
a = np.arange(6).reshape((3, 2))
print(a)


print(np.reshape(a, (2, 3)))

#for making a larger array into a single list, we'd make one of the elements of the reshape "1"
print(np.reshape(a, (6, 1)))

#...which now has each index in it's own row of a vector.


#In higher dimensions:
b = np.arange(24).reshape((2, 4,3))
print(b)
print('reshapes to ')
print(np.reshape(b, (24,1)))

[[0 1]
 [2 3]
 [4 5]]
[[0 1 2]
 [3 4 5]]
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]]
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]
  [ 9 10 11]]

 [[12 13 14]
  [15 16 17]
  [18 19 20]
  [21 22 23]]]
reshapes to 
[[ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [15]
 [16]
 [17]
 [18]
 [19]
 [20]
 [21]
 [22]
 [23]]


If you'll be wanting to move from one index to the other often, you may also create a couple of dictonaries of the form `1D-to-ND` and vice-versa where you can input the integer key of the `1D` array to get the coordinate tuple OR a input the tuple and get the integer value.