# The OpenAI Gym 
#### (Author : Soufiane Fadel)
\\


In order to download and install OpenAI Gym, you can use any of the following options:

In [0]:
!pip3 install gym[all]

To understand the basics of importing Gym packages, loading an environment, and other important functions associated with OpenAI Gym, here's an example of a Frozen Lake environment.

In [0]:
import gym
env = gym.make('FrozenLake-v0') 

## FrozenLake Game 

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.


*Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.*

The surface is described using a grid like the following:

* SFFF       (S: starting point, safe)
* FHFH       (F: frozen surface, safe)
* FFFH       (H: hole, fall to your doom)
* HFFG       (G: goal, where the frisbee is located) 

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.




Next, we come to resetting the environment. While performing a reinforcement learning task, an agent undergoes learning through multiple episodes. As a result, at the start of each episode, the environment needs to be reset so that it comes to its initial situation and the agent begins from the start state. The following code shows the process for resetting an environment:

After taking each action, there might be a requirement to show the status of the agent in the environment. Visualizing that status is done by:


In [0]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [0]:
import gym 
env = gym.make("FrozenLake-v0")
s = env.reset()  # resets the environement and returns the start state as a value 
print(s)  # the initial state is 0 

0


In newer versions of the Gym, the environment features can't be modified directly. This is done by unwrapping the environment parameters with:

In [0]:
env = env.unwrapped 

Each environment is defined by the **state spaces** and **action spaces** for the agent to perform. The type (discrete or continuous) and size of state spaces and action spaces is very important to know in order to build a reinforcement learning agent:

In [0]:
print(env.action_space)
print(env.action_space.n)

Discrete(4)
4


The **Discrete(4)** output means that the action space of the Frozen Lake environment is a discrete set of values and has four distinct actions that can be performed by the agent.


In [0]:
print(env.observation_space)
print(env.observation_space.n)

Discrete(16)
16


The **Discrete(16)** output means that the observation (state) space of the Frozen
Lake environment is a discrete set of values and has 16 different states to be explored by the agent.

## Programming an agent for Frozen Lake Game with Q-Learning Epsilon-Greedy approach

In [0]:
from __future__ import print_function
import gym
import numpy as np
import time


#### Load the environment

In [0]:
env = gym.make('FrozenLake-v0')
s = env.reset()
print(s)
env.render() # show the status of the agent in the environment

0

[41mS[0mFFF
FHFH
FFFH
HFFG


#### The epsilon greedy function  

In [0]:
def epsilon_greedy(Q,s,na):
    epsilon = 0.3
    p = np.random.uniform(low=0,high=1)
    #print(p)
    if p > epsilon:
        return np.argmax(Q[s,:]) # say here,initial policy = for each state consider the action having highest Q-value
    else:
        return env.action_space.sample()

#### Q-Learning Implementation

In [0]:
#Initializing Q-table with zeros
Q = np.zeros([env.observation_space.n,env.action_space.n]) 

# set hyperparameters
lr = 0.5  # learning rate
y = 0.9  # discount factor lambda
eps = 600000  # total episodes being 100000

###### Training 

In [0]:
t1 = time.time()

for i in range(eps):
    s = env.reset()
    t = False
    while(True):
        a = epsilon_greedy(Q,s,env.action_space.n)
        s_,r,t,_ = env.step(a)
        if (r==0):  
            if t==True:
                r = -5 # to give negative rewards when holes turn up
                Q[s_] = np.ones(env.action_space.n)*r    #in terminal state Q value equals the reward
            else:
                r = -1  # to give negative rewards to avoid long routes
        if (r==1):
                r = 100
                Q[s_] = np.ones(env.action_space.n)*r    #in terminal state Q value equals the reward
        Q[s,a] = Q[s,a] + lr * (r + y*np.max(Q[s_,a]) - Q[s,a])
        s = s_   
        if (t == True) :
                break

t2 = time.time()
print("the training time for q-learning with epsillon greedy approach is : " + str((t2-t1)/60) + " min" )

the training time for q-learning with epsillon greedy approach is : 3.2827404936154685 min


#### Print the Q-Table 

In [0]:
print("Q-table")
print(Q)

Q-table
[[ -9.87042831  -9.237709    -9.62332946 -10.        ]
 [ -9.65775364  -8.80562147  -9.5638128  -10.        ]
 [ -9.8090408   -7.78664567  -9.69939168 -10.        ]
 [ -9.63623497  -9.15642147  -9.62434233 -10.        ]
 [ -9.83454456  -9.34574071  -9.58726682  -9.74353066]
 [ -5.          -5.          -5.          -5.        ]
 [ -9.78640909  -7.14843793  -9.55192421  -9.61975158]
 [ -5.          -5.          -5.          -5.        ]
 [ -9.84750036  -7.707354    -9.53242899  -9.72817188]
 [ -9.73596282  -4.76495532  -9.21988037  -9.69629991]
 [ -9.8111297   -7.68835189  -9.59619696  -9.55662704]
 [ -5.          -5.          -5.          -5.        ]
 [ -5.          -5.          -5.          -5.        ]
 [ -9.5982959   -3.66647969  18.3287359   -4.90939623]
 [ -9.7953489    4.08707219  47.86463047  14.4963989 ]
 [100.         100.         100.         100.        ]]


#### Testing the trained agent 

In [0]:
s = env.reset()
env.render()
while(True):
    a = np.argmax(Q[s])
    s_,r,t,_ = env.step(a)
    env.render()
    s = s_
    if(t==True) :
        break        


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Down)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m


## Programming an agent for Frozen Lake Game with Q-Network approach

In [0]:
import gym
import numpy as np
import tensorflow as tf
import random
from matplotlib import pyplot as plt
import time 

#Load the Environment
env = gym.make('FrozenLake-v0')


#### Creating Neural Network and the defining the Loss

In [0]:
tf.reset_default_graph()

#tensors for inputs, weights, biases, Qtarget
inputs = tf.placeholder(shape=[None,env.observation_space.n],dtype=tf.float32)
W = tf.get_variable(name="W",dtype=tf.float32,shape=[env.observation_space.n,env.action_space.n],initializer=tf.contrib.layers.xavier_initializer())
b = tf.Variable(tf.zeros(shape=[env.action_space.n]),dtype=tf.float32)
qpred = tf.add(tf.matmul(inputs,W),b)
apred = tf.argmax(qpred,1)

qtar = tf.placeholder(shape=[1,env.action_space.n],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(qtar-qpred))

train = tf.train.AdamOptimizer(learning_rate=0.001)
minimizer = train.minimize(loss)


The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



#### Training the neural network

###### Initialization

In [0]:
init = tf.global_variables_initializer()

#learning parameters
y = 0.5
e = 0.3
episodes = 100000

#list to capture total steps and rewards per episodes
slist = []
rlist = []

###### Training Loop

In [0]:
t1 = time.time()

with tf.Session() as sess:
    sess.run(init)
    for i in range(episodes):
        s = env.reset() # resetting the environment at the start of each episode
        r_total = 0  # to calculate the sum of rewards in the current episode
        while(True):
            #running the Q-network created above
            a_pred,q_pred = sess.run([apred,qpred],feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
            #a_pred is the action prediction by the neural network
            #q_pred contains q_values of the actions at current state 's'
            if np.random.uniform(low=0,high=1) < e:
                a_pred[0] = env.action_space.sample()
                #exploring different action by randomly assigning them as the next action
            s_,r,t,_ = env.step(a_pred[0])  #action taken and new state 's_' is encountered with a feedback reward 'r'
            if r==0: 
                if t==True:
                    r=-5  #if hole make the reward more negative
                else:
                    r=-1  #if block is fine/frozen then give slight negative reward to optimise the path
            if r==1:
                    r=5       #good positive goat state reward

            q_pred_new = sess.run(qpred,feed_dict={inputs:np.identity(env.observation_space.n)[s_:s_+1]})
            #q_pred_new contains q_values of the actions at the new state 

            #update the Q-target value for action taken
            targetQ = q_pred
            max_qpredn = np.max(q_pred_new)
            targetQ[0,a_pred[0]] = r + y*max_qpredn
            #this gives our targetQ

            #train the neural network to minimise the loss
            _ = sess.run(minimizer,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1],qtar:targetQ})
            r_total+=r

            s=s_
            if t==True:
                break
    
# learning ends with the end of the loop of several episodes above
    
t2 = time.time()
print("the training time for q-learning with epsillon greedy approach is : " + str((t2-t1)/60) + " min" )

the training time for q-learning with epsillon greedy approach is : 70.54682284990946 min


###### Check how much our Q-network agent has learned 

In [0]:
with tf.Session() as sess:
  sess.run(init)
  #learning ends with the end of the loop of several episodes above
  #let's check how much our agent has learned
  s = env.reset()
  env.render()
  while(True):
      a = sess.run(apred,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
      s_,r,t,_ = env.step(a[0])
      env.render()
      s = s_
      if t==True:
          break



[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
