## Step1. Install AI Gym, the instructions can be found at https://gym.openai.com/docs/

In [1]:
import numpy as np
import gym
import random
import pandas as pd
import time

## Step2. Import the FrozenLake-v0 environment

In [2]:
env = gym.make("FrozenLake-v0")

Rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size

In [3]:
n_size = env.action_space.n
m_size = env.observation_space.n

In [4]:
qtable = np.zeros((m_size, n_size))
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


### Specifying Hyperparameters

In [5]:
total_episodes = 15000
learning_rate = 0.1
max_steps = 120
gamma = 0.95
epsilon = 0.9
max_epsilon = 0.9
min_epsilon = 0.01 
decay_rate = 0.001

## Step3. Train a model using Q-learning and generate a Q-table, save this table as separate file

In [6]:
finalRewards = []

start_time=time.time()

for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        ## First we randomize a number
        exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        else:
            action = env.action_space.sample()
    
        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]     
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # Dead state
        if done == True: 
            break
        
    # Epsilon reduction as we want to limit scope
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    finalRewards.append(total_rewards)
    
end_time=time.time()

print ("Success rate: " +  str(sum(finalRewards)/total_episodes))
print("Time taken to train the model:", round((end_time-start_time)/60,2),"mins")

Success rate: 0.5396
Time taken to train the model: 0.43 mins


In [7]:
df=pd.DataFrame(qtable)
df.to_csv('qtable.csv')
print(df)

           0         1         2         3
0   0.185419  0.153966  0.157922  0.154258
1   0.094987  0.063773  0.084645  0.150995
2   0.156970  0.100657  0.090816  0.092033
3   0.025658  0.060784  0.009628  0.009657
4   0.214032  0.158227  0.146966  0.151399
5   0.000000  0.000000  0.000000  0.000000
6   0.077205  0.078083  0.199557  0.060935
7   0.000000  0.000000  0.000000  0.000000
8   0.141160  0.171727  0.179706  0.289443
9   0.260007  0.394554  0.266592  0.238510
10  0.367916  0.259917  0.147884  0.203182
11  0.000000  0.000000  0.000000  0.000000
12  0.000000  0.000000  0.000000  0.000000
13  0.282553  0.379642  0.460882  0.378357
14  0.488717  0.683225  0.574123  0.517127
15  0.000000  0.000000  0.000000  0.000000


## Step4. Discuss the parameters used to produce you training results

In [8]:
#total_episodes = 15000
#learning_rate = 0.1
#max_steps = 120
#gamma = 0.95
#epsilon = 0.9
#max_epsilon = 0.9
#min_epsilon = 0.01 
#decay_rate = 0.001

1) The Agent tries out various'state-action' combinations until it either achieves its goal or falls into the pit. Each of these investigations will be referred to as an episode. We begin the following episode each time the agent reaches the target or is terminated.</br>
2) The learning rate tells us how fast we want to reach the global minima. Specifying a huge learning rate could result in not reaching global minima. </br>
3) max_steps is the max steps that would be required to reach the goal. The max step should be kept a little high so that we reach the goal in every iteration. </br>
4)  We want the agent to execute random acts when it first starts learning so that it can explore more paths. The Q-function converges to increasingly consistent Q-values as the agent improves. Now we want our agent to conduct greedy actions and exploit pathways with the highest Q-value. This is where epsilon enters the picture.
For probability (1-epsilon), the agent takes random actions, while for probability (1-epsilon), it takes greedy actions.</br>
5) A decaying -greedy action selection was employed by Google DeepMind. Where decays from 1 to 0.1 over time — at first, the system makes fully random motions to maximize its exploration of the state space, and eventually it settles down to a fixed exploration rate. That if why I have used max_epsilon=0.9 and min_epsilon=0.01 with a decay rate of 0.001 

## Step5. Discuss the results in terms of success rate, the time it took to train

In [9]:
print ("Success Rate: " +  str(sum(finalRewards)/total_episodes))
print("Time taken to train the model:", round((end_time-start_time)/60,2),"min")

Success Rate: 0.5396
Time taken to train the model: 0.43 min


We can see that out of 15000 iterations, we have reached a goal 53.96% of the times which is our success rate.