# Chapter 1: Running an environment

Let's make the robot learn how to move around this simple loop maze.<br>

<table style="float:left;background: #407EAF">
<tr>
<th>
<p class="transparent">Execute in WebShell #1</p>
</th>
</tr>
</table>

In [None]:
rosrun gym_construct simple_qlearn.py

This will execute a simple learning protocol of 10 episodes. It is rewarded more for going forwards than turning, and the episode stops when the time ends or the robot runs into a wall (the distance from the nearest wall is less than 0.2 meters). You should see something similar to this.

<img src="img/turtlebotlearning_example.gif" />

So... how does this work? Let's have a look at the script you have just executed.

<table style="float:left;background: #407EAF">
<tr>
<th>
<p class="transparent">Execute in WebShell #1</p>
</th>
</tr>
</table>

In [None]:
rosed gym_construct simple_qlearn.py

This command will open the specified file **<i>simple_qlearn.py</i>** using the <a href="http://www.vim.org" target="_blank">**vim**</a> editor, which is a text editor for terminals.

<span style="color:green;">**NOTE:** Enter **:q** to get out of the vim editor</span>

You can find this script inside the **scripts** folder of the **gym_construct** package.

You will see the following file:

<p style="background:green;color:white;">**simple_qlearn.py**</p>

In [1]:
#!/usr/bin/env python
import gym
import gym_gazebo
import time
import numpy
import random
import time
import matplotlib
import matplotlib.pyplot as plt
import qlearn
# import liveplot
from gym import wrappers
from liveplot import LivePlot 

def render():
    render_skip = 0 #Skip first X episodes.
    render_interval = 50 #Show render Every Y episode.
    render_episodes = 10 #Show Z episodes every rendering.

    if (x%render_interval == 0) and (x != 0) and (x > render_skip):
        env.render()
    elif ((x-render_episodes)%render_interval == 0) and (x != 0) and (x > render_skip) and (render_episodes < x):
        env.render(close=True)

if __name__ == '__main__':

    env = gym.make('GazeboCircuitTurtlebotLidar-v0')
    print "Gym Make done"
    outdir = '/tmp/gazebo_gym_experiments'
    #outdir = '/home/user/catkin_ws/src/gym_construct/src/gazebo_gym_experiments'
    # env.monitor.start(outdir, force=True, seed=None)       # I had to comment this and
    env = wrappers.Monitor(env, outdir, force=True)          # use this to avoid warnings
    #plotter = LivePlot(outdir)
    print "Monitor Wrapper started"
    last_time_steps = numpy.ndarray(0)

    qlearn = qlearn.QLearn(actions=range(env.action_space.n),
                    alpha=0.1, gamma=0.8, epsilon=0.9)

    initial_epsilon = qlearn.epsilon

    epsilon_discount = 0.999 # 1098 eps to reach 0.1

    start_time = time.time()
    total_episodes = 10
    highest_reward = 0

    for x in range(total_episodes):
        done = False

        cumulated_reward = 0 #Should going forward give more reward than L/R ?
        print ("Episode = "+str(x))
        observation = env.reset()
        if qlearn.epsilon > 0.05:
            qlearn.epsilon *= epsilon_discount

        #render()
        env.render()

        state = ''.join(map(str, observation))

        for i in range(500):

            # Pick an action based on the current state
            action = qlearn.chooseAction(state)
            #print ("Action Chosen"+str(action))
            # Execute the action and get feedback
            observation, reward, done, info = env.step(action)
            cumulated_reward += reward
            #print ("Reward="+str(reward))
            if highest_reward < cumulated_reward:
                highest_reward = cumulated_reward

            nextState = ''.join(map(str, observation))

            qlearn.learn(state, action, reward, nextState)

            #env.monitor.flush(force=True)

            if not(done):
                #print "NOT done"
                state = nextState
            else:
                print "DONE"
                last_time_steps = numpy.append(last_time_steps, [int(i + 1)])
                break 

        m, s = divmod(int(time.time() - start_time), 60)
        h, m = divmod(m, 60)
        print ("EP: "+str(x+1)+" - [alpha: "+str(round(qlearn.alpha,2))+" - gamma: "+str(round(qlearn.gamma,2))+" - epsilon: "+str(round(qlearn.epsilon,2))+"] - Reward: "+str(cumulated_reward)+"     Time: %d:%02d:%02d" % (h, m, s))

    #Github table content
    print ("\n|"+str(total_episodes)+"|"+str(qlearn.alpha)+"|"+str(qlearn.gamma)+"|"+str(initial_epsilon)+"*"+str(epsilon_discount)+"|"+str(highest_reward)+"| PICTURE |")

    l = last_time_steps.tolist()
    l.sort()

    #print("Parameters: a="+str)
    print("Overall score: {:0.2f}".format(last_time_steps.mean()))
    print("Best 100 score: {:0.2f}".format(reduce(lambda x, y: x + y, l[-100:]) / len(l[-100:])))

    #env.monitor.close()
    #env.close()


SyntaxError: Missing parentheses in call to 'print' (<ipython-input-1-319c8b4a904f>, line 28)

<p style="background:green;color:white;">**simple_qlearn.py**</p>

Let's concentrate on the OpenAI-Gym basic infrastructure, leaving the AI algorithm operation out of the scope of this course.

### Select the environment

So, the first thing is the setup of the gym environment:

This is made by giving the environment ID that will be used. This imports all the data specified in the environment configuration file: like the actions available, the rewards that will be given, and the basic communication with all of the Gazebo and ROS infrastructures. As you can see, nothing referring to ROS or Gazebo is stated here. This could be a video game, a different simulator, or even a real robot. The same program would be used, only changing the environment if the interface was the same.

### Time Steps 

This will state how many time steps will be made before going to the next episode. Keep this number in mind because, in the environment setup, you will have to put the same number or higher to avoid gym module related errors.

### Start up the monitoring 

In AI learning, it's essential to record the results and learning results data in order to evaluate if it's really learning or not. Gym gives a wrapper that will record the state of the system data, called <b>observations</b>. The observations are what we consider relevant data for learning and making decisions as to what actions should be performed next.

### Take a learning step

In between all the qlearn algorithm related code, concentrate on the **<i>env.step()</i>** method. It returns:
<ul>
<li><i>**observation**</i>: The observation is the state of the environment. It will return different kinds of data depending on how the environment set-up file is defined. In this case, it returns a discrete version of the laser readings of the robot. In your personal case, it has to be data needed to make AI decisions. It could be altitude, image data, sonar data, pointclouds, tactile data, etc. Anything that your AI algorithm needs in order to decide what's the next action.</li>
<li><i>**reward**</i>: It's the reward for the current step taken. The higher the reward, the better the robot is performing, based on the conditions you stated.</li>
<li><i>**done**</i>: It states whether the episode is done or not. In this case, it will be <b>done = True</b> if the robot has gone too close to a wall (0.2 in the laser sensor readings ).</li>
<li><i>**info**</i>: Extra information. In this case, it's empty.</li>
</ul>

<p style="background:#EE9023;color:white;">**Exercise 1.1**</p>

a) First of all, you will have to create a local copy (to your Ignite workspace) of the **simple_qlearn.py** learning script. You can name it **my_simple_learning_turtlebot.py**. Also, copy the required Python modules, **qlearn.py** and **live_plot.py,** to your workspace. This is done because, this way, you will have your own copy of the script and you will be able to mess with it without any implications.

<table style="float:left;background: #407EAF">
<tr>
<th>
<p class="transparent">Execute in WebShell #1</p>
</th>
</tr>
</table>

In [None]:
roscd gym_construct/scripts
cp simple_qlearn.py /home/user/catkin_ws/src/my_simple_learning_turtlebot.py
cp qlearn.py /home/user/catkin_ws/src/
cp liveplot.py /home/user/catkin_ws/src/

At the end, you should have these three new files in your workspace:

<img src="img/openai_files_copied.png" width="300" />

b) Try changing the number of **episodes**, and check if the robot performs better with more time given to learn.
Keep in mind that in order to make it perform properly, you could easily need 2000 episodes. We recommend you just run **10** or so to see how it performs.

c) Implement a new script using a different learning strategy; for instance, the **sarsa**. You have an almost identical example in the **circuit2_turtlebot_lidar_sarsa.py** file, inside **gym_construct/scripts**. Just change the environment to the current one, and also copy the sarsa Python module.

d) Implement your own algorithm based on the data returned by the **env.step()** method.

<span style="color:green;">**NOTE:** If, for any reason, your Python script stops working, you might have to **kill its process** in order to stop it, because the **CTRL+C** command may not be enough to stop it. This is because of how these Gym Python scripts are constructed. In order to stop the process, you can do the following:</span>

<table style="float:left;background: #407EAF">
<tr>
<th>
<p class="transparent">Execute in WebShell #1</p>
</th>
</tr>
</table>

In [None]:
ps faux | grep python
kill -9 id__your_script_process

<p style="background:#EE9023;color:white;">**End of Exercise 1.1**</p>