## QLearning on Open AI Gym

[OpenAI Gym](https://github.com/openai/gym) is a toolkit for developing and comparing reinforcement learning algorithms. gym makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano.

There are two basic concepts in reinforcement learning: the environment (namely, the outside world) and the agent (namely, the algorithm you are writing). The agent sends actions to the environment, and the environment replies with observations and rewards (that is, a score).

We will create some AI Gym environments and use the APIs from AI Gym to solve these environemnts both through Q Learning as well as random exploration

### Install AI Gym Environment

In [1]:
#!conda create -n gym python=3 pip
#!activate gym
#!pip install gym
#!conda install pystan
#!pip install gym[all]

In [10]:
import gym
from gym import envs
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [11]:
#print all AI Gym environments
envs.registry.all()

dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v2), EnvSpec(BipedalWalkerHardcore-v2), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0), EnvSpec(KellyCoinflip-v0), EnvSpec(KellyCoinflipGeneralized-v0), EnvSpec(FrozenLake-v0), EnvSpec(FrozenLake8x8-v0), EnvSpec(CliffWalking-v0), EnvSpec(NChain-v0), EnvSpec(Roulette-v0), EnvSpec(Taxi-v2), EnvSpec(GuessingGame-v0), EnvSpec(HotterColder-v0), EnvSpec(Reacher-v2), EnvSpec(Pusher-v2), EnvSpec(Thrower-v2), EnvSpec(Striker-v2), EnvSpec(InvertedPendulum-v2), EnvSpec(InvertedDoublePendulum-v2), EnvSpec(HalfCheetah-v2), EnvSpec(Hopper-v2), EnvSpec(Swimmer-v2), EnvSpec(Walker2d-v2), EnvSpec(Ant-v2), EnvSpec(Hum

In [12]:
#function that renders the environment and outputs space variables
def showEnv(env):  
    env.reset()
    env.render()
    print("\n------ Observation Space:", env.observation_space.n, ", Action Space:", env.action_space.n)

In [13]:
#function that invokes 2 steps in the environment
def take2steps(env):  
    #Every Gym environment will return these same four variables after an action is taken
    #they are the core variables of a reinforcement learning problem.
    print("\n------ Taking 2 Steps in environment:")
    state, reward, done, info = env.step(1)
    print('State t1: State: {}, Reward: {}, Done: {}, Info{}'.format(state, reward, done, info))
    state, reward, done, info = env.step(1)
    print('State t2: State: {}, Reward: {}, Done: {}, Info{}'.format(state, reward, done, info))

### Solve Environment Randomly

Create a loop that will take random actions until the environment is solved.

In [14]:
def solveEnvRandom(env, max_reward):  
    counter = 0
    reward = 0
    while reward < max_reward:
        state, reward, done, info = env.step(env.action_space.sample())
        counter += 1
    print("\n------ Environment solved in", counter, "random steps")
    env.render()

### Solve thorough QLearning 

First (#1): The agent starts by choosing an action with the highest Q value for the current state using argmax. Argmax will return the index/action with the highest value for that state. Initially, our Q table will be all zeros. But, after every step, the Q values for state-action pairs will be updated.

Second (#2): The agent then takes action and we store the future state as state2 (St+1). This will allow the agent to compare the previous state to the new state.

Third (#3): We update the state-action pair (St , At) for Q using the reward, and the max Q value for state2 (St+1). This update is done using the action value formula (based upon the Bellman equation) and allows state-action pairs to be updated in a recursive fashion (based on future values). See Figure 2 for the value iteration update.

<img src="https://upload.wikimedia.org/wikipedia/commons/f/f0/Q-l%C3%A6ring_formel_1.png" />

In [23]:
def solveEnvQ(env): 
    print("\n------ Solving Environment through Q Learning ")
    Q = np.zeros([env.observation_space.n, env.action_space.n])

    alpha = 0.618
    gamma = 0.9
    epsilon = 0.1
    episodes = 5000
    rewardAll = []
    
    #iterate over the total number of episodes
    for episode in range(episodes):
        done = False
        totalReward, reward = 0,0
        state = env.reset()
        while done != True:
                #1.Choose action from Q table
                #action = np.argmax(Q[state])
                
                #1.Choose an action by greedily (with noise) picking from Q table
                action = np.argmax(Q[state,:] + np.random.randn(1,env.action_space.n)*(1./(episode+1)))
                
                #2. Get new state & reward from environment
                state2, reward, done, info = env.step(action) #2
                
                #3. Update Q-Table with new knowledge
                Q[state,action] += alpha * (reward + gamma*np.max(Q[state2]) - Q[state,action])
                                
                #Total Reward in current episode
                totalReward += reward
                
                #Update State
                state = state2   
                
        rewardAll.append(totalReward)
        if episode % 500 == 0:
            print('Episode {}, Total Reward: {}'.format(episode,totalReward))
    
    #render final environment
    print ("Final Environment: ")
    env.render()
    
    #print final Q Table
    print ("Reward Sum on all episodes " + str(sum(rewardAll)/episodes))
    print ("Final Values Q-Table: ")
    print (Q)

### Taxi 

“There are 4 locations (labelled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.” (Source: https://gym.openai.com/envs/Taxi-v2/ )

In [18]:
env = gym.make("Taxi-v2")

#display environment
showEnv(env)

#take 2 steps in environment
take2steps(env)

#solve environment by Q learning
solveEnvQ(env)

#solve environment randomly
solveEnvRandom(env, 20)

+---------+
|R: | : :[34;1mG[0m|
| : : : : |
| : : : : |
| | : | :[43m [0m|
|[35mY[0m| : |B: |
+---------+


------ Observation Space: 500 , Action Space: 6

------ Taking 2 Steps in environment:
State t1: State: 286, Reward: -1, Done: False, Info{'prob': 1.0}
State t2: State: 186, Reward: -1, Done: False, Info{'prob': 1.0}

------ Solving Environment through Q Learning 
Episode 0, Total Reward: -560
Episode 500, Total Reward: 12
Final Environment: 
+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Reward Sum on all episodes -17.624
Final Values Q-Table: 
[[ 0.          0.          0.          0.          0.          0.        ]
 [-4.15437569 -3.99706793 -3.8282828  -3.99706793  1.62261467 -6.18      ]
 [-3.45274539 -3.60164989 -3.47657088 -3.60164989  7.7147     -6.18      ]
 ...
 [-2.73095859 -2.50650277 -2.73095859 -2.53123591 -6.18       -6.18      ]
 [-3.99706793 -3.76617866 -3.99706793 -3.85997197 -6

### Frozen Lake

The [**FrozenLake environment**](https://gym.openai.com/envs/FrozenLake-v0/) consists of a 4x4 grid of blocks, each one either being the start block, the goal block, a safe frozen block, or a dangerous hole. The objective is to have an agent learn to navigate from the start to the goal without moving onto a hole.  At any given time the agent can choose to move either up, down, left, or right. The catch is that there is a wind which occasionally blows the agent onto a space they didn’t choose. As such, perfect performance every time is impossible, but learning to avoid the holes and reach the goal are certainly still doable. The reward at every step is 0, except for entering the goal, which provides a reward of 1. 


<br>SFFF       (S: starting point, safe)
<br>FHFH       (F: frozen surface, safe)
<br>FFFH       (H: hole, fall to your doom)
<br>HFFG       (G: goal, where the frisbee is located)

In [24]:
env = gym.make('FrozenLake-v0')

#display environment
showEnv(env)

#take 2 steps in environment
take2steps(env)

solveEnvQ(env)


[41mS[0mFFF
FHFH
FFFH
HFFG

------ Observation Space: 16 , Action Space: 4

------ Taking 2 Steps in environment:
State t1: State: 0, Reward: 0.0, Done: False, Info{'prob': 0.3333333333333333}
State t2: State: 4, Reward: 0.0, Done: False, Info{'prob': 0.3333333333333333}

------ Solving Environment through Q Learning 
Episode 0, Total Reward: 0.0
Episode 500, Total Reward: 0.0
Episode 1000, Total Reward: 1.0
Episode 1500, Total Reward: 1.0
Episode 2000, Total Reward: 1.0
Episode 2500, Total Reward: 0.0
Episode 3000, Total Reward: 0.0
Episode 3500, Total Reward: 1.0
Episode 4000, Total Reward: 1.0
Episode 4500, Total Reward: 1.0
Final Environment: 
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
Reward Sum on all episodes 0.3984
Final Values Q-Table: 
[[5.45148732e-02 2.24289198e-03 1.54820674e-03 2.57370119e-03]
 [2.55659244e-04 1.11674919e-03 8.39357980e-04 6.47798982e-02]
 [7.22054679e-04 5.81631822e-02 7.45345495e-04 1.01081946e-03]
 [9.71566805e-04 5.53557418e-04 2.72134453e-05 3.5103110