<a href="https://colab.research.google.com/github/rahiakela/deep-reinforcement-learning-with-python/blob/main/02-guide-to-gym-toolkit/1_creating_gym_environment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Creating gym environment

We learned that the gym provides a variety of environments for training the reinforcement learning agent. To clearly understand how the gym environment is designed, we will start off with the basic gym environment. Going forward, we will understand all other complex gym environments. 

Let's introduce one of the simplest environments called the frozen lake environment. The frozen lake environment is shown below. As we can observe, in the frozen lake environment, the goal of the agent is to start from the initial state S and reach the goal state G.

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env1.png?raw=1'/>

In the above environment, the following applies:

* S denotes the starting state
* F denotes the frozen state
* H denotes the hole state
* G denotes the goal state

So, the agent has to start from the state S and reach the goal state G. But one issue is that if the agent visits the state H, which is just the hole state, then the agent will fall into the hole and die as shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env2.png?raw=1'/>

So, we need to make sure that the agent starts from S and reaches G without falling into the hole state H as shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env3.png?raw=1'/>

Each grid box in the above environment is called state, thus we have 16 states (S to G) and we have 4 possible actions which are up, down, left and right. We learned that our goal is to reach the state G from S without visiting H. So, we assign reward as 0 to all the states and + 1 for the goal state G. 

Thus, we learned how the frozen lake environment works. Now, to train our agent in the frozen lake environment, first, we need to create the environment by coding it from scratch in Python. But luckily we don't have to do that! Since the gym provides the various environment, we can directly import the gym toolkit and create a frozen lake environment using the gym.


Now, we will learn how to create our frozen lake environment using the gym. Before running any code, make sure that you activated our virtual environment universe. First, let's import the gym library:


In [1]:
import gym

Next, we can create a gym environment using the make function.  The make function requires the environment id as a parameter. 

In the gym, the id of the frozen lake environment is `FrozenLake-v0`. So, we can create our frozen lake environment as shown below:

In [2]:
env = gym.make("FrozenLake-v0")

After creating the environment, we can see how our environment looks like using the render function:

In [3]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


As we can observe, the frozen lake environment consists of 16 states (S to G) as we learned. The state S is highlighted indicating that it is our current state, that is, agent is in the state S. So whenever we create an environment, an agent will always begin from the initial state, in our case, it is the state S. 

## Exploring the environment

We learned that the reinforcement learning environment can be modeled as the Markov decision process (MDP) and an MDP consists of the following: 

* __States__ -  A set of states present in the environment 
* __Actions__ - A set of actions that the agent can perform in each state. 
* __Transition probability__ - The transition probability is denoted by $P(s'|s,a) $. It implies the probability of moving from a state $s$ to the state $s'$ while performing an action $a$.
* __Reward function__ - Reward function is denoted by $R(s,a,s')$. It implies the reward the agent obtains moving from a state $s$ to the state  $s'$ while performing an action $a$.

Let's now understand how to obtain all the above information from the frozen lake environment we just created using the gym.

## States

A state space consists of all of our states. We can obtain the number of states in our environment by just typing `env.observation_space` as shown below:

In [4]:
print(env.observation_space)

Discrete(16)


It implies that we have 16 discrete states in our state space starting from the state S to G. Note that, in the gym, the states will be encoded as a number, so the state S will be encoded as 0, state F will be encoded as 1 and so on as shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env4.png?raw=1'/>

## Actions

We learned that the action space consists of all the possible actions in the environment. We can obtain the action space by `env.action_space` as shown below:

In [5]:
print(env.action_space)

Discrete(4)


It implies that we have 4 discrete actions in our action space which are left, down, right, up. Note that, similar to states, actions also will be encoded into numbers as shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env5.png?raw=1'/>

## Transition probability and Reward function

Now, let's look at how to obtain the transition probability and the reward function. We learned that in the stochastic environment, we cannot say that by performing some action $a$, agent will always reach the next state $s'$ exactly because there will be some randomness associated with the stochastic environment and by performing an action $a$ in the state $s$, agent reaches the next state  with some probability.

Let's suppose we are in state 2 (F). Now if we perform action 1 (down) in state 2, we can reach the state 6 as shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env6.png?raw=1'/>

Our frozen lake environment is a stochastic environment. When our environment is stochastic we won't always reach the state 6 by performing action 1(down) in state 2, we also reach other states with some probability. So when we perform an action 1 (down) in the state 2, we reach state 1 with probability 0.33333, we reach state 6 with probability 0.33333 and we reach the state 3 with probability 0.33333 as shown below:


<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env7.png?raw=1'/>


As we can notice, in the stochastic environment we reach the next states with some probability. Now, let's learn how to obtain this transition probability using the gym environment.  

We can obtain the transition probability and the reward function by just typing `env.P[state][action]` So, in order to obtain the transition probability of moving from the state S to the other states by performing an action right, we can type, `env.P[S][right]`. But we cannot just type state S and action right directly since they are encoded into numbers. We learned that state S is encoded as 0 and the action right is encoded as 2, so, in order to obtain the transition probability of state S by performing an action right, we type `env.P[0][2]` as shown below:

In [6]:
print(env.P[0][2])

[(0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False)]


What does this imply? Our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 2 (right) in state 0 (S) then:

* We reach the state 4 (F) with probability 0.33333 and receive 0 reward. 
* We reach the state 1 (F) with probability 0.33333 and receive 0 reward.
* We reach the same state 0 (S) with probability 0.33333 and receive 0 reward.

The transition probability is shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env8.png?raw=1'/>

Thus, when we type `env.P[state][action]` we get the result in the form of `[(transition probability, next state, reward, Is terminal state?)]`. The last value is the boolean and it implies that whether the next state is a terminal state, since 4, 1 and 0 are not the terminal states it is given as false. 

The output of `env.P[0][2]` is shown in the below table for more clarity:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env9.png?raw=1'/>

Let's understand this with one more example. Let's suppose we are in the state 3 (F) as shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env10.png?raw=1'/>

Say, we perform action 1 (down) in the state 3(F). Then the transition probability of the state 3(F) by performing action 1(down) can be obtained as shown below:



In [7]:
print(env.P[3][1])

[(0.3333333333333333, 2, 0.0, False), (0.3333333333333333, 7, 0.0, True), (0.3333333333333333, 3, 0.0, False)]


As we learned, our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 1 (down) in state 3 (F) then:

* We reach the state 2 (F) with probability 0.33333 and receive 0 reward. 
* We reach the state 7 (H) with probability 0.33333 and receive 0 reward.
* We reach the same state 3 (F) with probability 0.33333 and receive 0 reward.


The transition probability is shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env11.png?raw=1'/>


The output of `env.P[3][1]` is shown in the below table for more clarity:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/frozen-lake-env12.png?raw=1'/>

As we can observe, in the second row of our output, we have, `(0.33333, 7, 0.0, True)`,and the last value here is marked as True. It implies that state 7 is a terminal state. That is, if we perform action 1(down) in state 3(F) then we reach the state 7(H) with 0.33333 probability and since 7(H) is a hole, the agent dies if it reaches the state 7(H). Thus 7(H) is a terminal state and so it is marked as True. 

Thus, we learned how to obtain the state space, action space, transition probability and the reward function using the gym environment. In the next, we will learn how to generate an episode. 

## Generating an episode

We learned that the agent-environment interaction starting from an initial state until the terminal state is called an episode.

Before we begin, we initialize the state by resetting our environment; resetting puts our agent back to the initial state. 

We can reset our environment using the reset() function as shown as follows:

In [8]:
state = env.reset()

### Action selection

In order for the agent to interact with the environment, it has to perform some
action in the environment. So, first, let's learn how to perform an action in the Gym environment. 

Let's suppose we are in state 1 (F).

Say we need to perform action 1 (down) and move to the new state 4 (F). How can
we do that? We can perform an action using the step function. We just need to input our action as a parameter to the step function. 

So, we can perform action 1 (down) in state 1 (S) using the step function as follows:

In [12]:
env.step(1)

(4, 0.0, False, {'prob': 0.3333333333333333})

Now, let's render our environment using the render function.

In [13]:
env.render()

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG


Note that whenever we make an action using env.step(), it outputs a tuple
containing 4 values. So, when we take action 1 (down) in state 1 (S) using env.
step(1), it gives the output as:

```
(4, 0.0, False, {'prob': 0.33333})
```

As you might have guessed, it implies that when we perform action 1 (down) in state 1 (S):

- We reach the next state 4 (F).
- The agent receives the reward 0.0.
- Since the next state 4 (F) is not terminal state, it is marked as False.
- We reach the next state 4 (F) with a probability of 0.33333.

So, we can just store this information as:

In [14]:
(next_state, reward, done, info) = env.step(1)

In [15]:
print(next_state)  # next_state represents the next state

4


In [16]:
print(reward)  # reward represents the obtained reward

0.0


In [17]:
print(done)  # done implies whether our episode has ended

False


In [18]:
print(info)  # which is used for debugging purposes

{'prob': 0.3333333333333333}


We can also sample action from our action space and perform a random action to
explore our environment. 

We can sample an action using the sample function:

In [19]:
random_action = env.action_space.sample()

After we have sampled an action from our action space, then we perform our
sampled action using our step function:

In [20]:
next_state, reward, done, info = env.step(random_action)

### Generating an episode

Now let's learn how to generate an episode. The episode is the agent environment
interaction starting from the initial state to the terminal state. The agent interacts with the environment by performing some action in each state. An episode ends if the agent reaches the terminal state. 

So, in the Frozen Lake environment, the episode will end if the agent reaches the terminal state, which is either the hole state (H) or goal state (G).

Let's understand how to generate an episode with the random policy. We learned
that the random policy selects a random action in each state. So, we will generate an episode by taking random actions in each state. So for each time step in the episode, we take a random action in each state and our episode will end if the agent reaches the terminal state.

First, let's set the number of time steps:

In [22]:
state = env.reset()
print('Time Step 0 :')
env.render()
num_timesteps = 20

Time Step 0 :

[41mS[0mFFF
FHFH
FFFH
HFFG


In [23]:
for t in range(num_timesteps):  # For each time step
  # Randomly select an action by sampling from the action space
  random_action = env.action_space.sample()
  # Perform the selected action
  next_state, reward, done, info = env.step(random_action)
  print("Time Step {} :".format(t + 1))

  env.render()

  # If the next state is the terminal state, then break. This implies that our episode ends
  if done:
    break

Time Step 1 :
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 2 :
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 3 :
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
Time Step 4 :
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


Instead of generating one episode, we can also generate a series of episodes by taking some random action in each state:

In [24]:
num_episodes = 10
num_timesteps = 20

for i in range(num_episodes):
  state = env.reset()
  print("Time Step 0 :")
  env.render()

  for t in range(num_timesteps):
    random_action = env.action_space.sample()
    next_state, reward, done, info = env.step(random_action)
    print("Time Step {} :".format(t + 1))

    env.render()

    if done:
      break

Time Step 0 :

[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 1 :
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
Time Step 2 :
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 3 :
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
Time Step 4 :
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 5 :
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 6 :
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 7 :
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG
Time Step 8 :
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
Time Step 0 :

[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 1 :
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 2 :
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 3 :
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 4 :
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
Time Step 5 :
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
Time Step 6 :
  (Right)
SF[41mF[0mF
FHFH
FFFH
HFFG
Time Step 7 :
  (Up)
SFF[41mF[0m
FHFH
FFFH
HFFG
Time Step 8 :
  (Right)
SFF[41mF[0m
FHFH
FFFH
HFFG
Time Step 9 :
  (Right)
SFFF
FHF[41mH[0m
FFFH
HFFG
Time Step 0 :

[41mS[0mFFF
F

## Conclusions

Thus, we can generate an episode by selecting a random action in each state by
sampling from the action space. 

But wait! What is the use of this? Why do we even need to generate an episode?

we learned that an agent can find the optimal policy (that is, the correct action in each state) by generating several episodes.But in the preceding
example, we just took random actions in each state over all the episodes.

How can the agent find the optimal policy? So, in the case of the Frozen Lake environment, how can the agent find the optimal policy that tells the agent to reach state G from state S without visiting the hole states H?

**This is where we need a reinforcement learning algorithm. Reinforcement learning is all about finding the optimal policy, that is, the policy that tells us what action to perform in each state.**

So far we have understood how the Gym environment works using the basic Frozen
Lake environment, but Gym has so many other functionalities and also several
interesting environments.