# ABE tutorial 1
## Setting up an ABE workshop

In this first tutorial let's setup a workshop to build agents, environments, and RL algorthms!

Steps:
* Install tianshou
* Check that it works
* Explore available algorithms and environments


## Tianshou

Tianshou is a python library that makes working with deep reinforcement learning easier. It's focus is on developing implimentations of reinforcement learning algorithms that can interact with a wide range of environments. You can read more of the documentation here: https://tianshou.org

There are some good tutorials on how to use the different modules of tianshou here: https://tianshou.org/en/stable/02_notebooks/L0_overview.html

Below we'll cover most of what's covered in those tutorials here, but with a focus on what we are covering in the ABE book.


### 1. Create a virtual environment

Creating a virtual environment is an easy way to make sure you can install all the right python libraries and versions for a specific project. First navigate to where you'd like to have your project and create a project folder.

Open up a terminal (command line) and navigate to your folder. Once inside the folder we can create the virtual environment. 

First we have to make sure the right python distribution is installed. Tianshou requries python 3.11 or higher. 

You can check which python you are using with the command:

```which python```

On Ubuntu you can install a newer version of python via:

```
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.11
```

Create the virtual environment by running the following:
```
python3.11 -m venv my_venv
```
Or if you already have a version 3.11 or greater then to create the virtual environment run the following:

```
python3 -m venv my_venv
```

The "my_env" is the name of your environment, feel free to change the name.

You now have a virtual environment!

### 2. Install python libraries

One of the main benifits of a virtual environment is that you can install all the libraries you need within it, without messing up any other python projects.

The first step is to activate your virtual environment. This ensures that any libraries you install will be added to that python environment and not some other environment.

Activate your virtual environment by typing the following into your terminal while inside your project folder (linux/mac):

```
source my_venv/bin/activate
```

or with windows: 
```
my_venv\Scripts\activate
```

You should now see that on your terminal line you have a (my_env) in front of the line. Any libraries we now install will be inside this virtual environent!

Let's install the libraries we'll need. Thankfully it's quite straightforward:

```pip install tianshou```

It should take a little time to install tianshou and all the libraries it depends on (e.g., torch, numpy, etc)


### 3. Check that it works!

Here we'll do a quick check that the installation worked. 





In [117]:
import tianshou
print(tianshou.__version__)

1.1.0


### 4. Train our first RL agent

For this first time we'll focus just on the pieces that need to be in place for us to train a RL agent. Later as we move through the tutorials we'll learn more about each of the peices and even start to customize some of them!

But for now let's use existing RL algorithms and some exsiting environments to just see how it all works together.

To start off let's create a new python file, let's call it first_RL.py

Import some libraries:

* **gymnasium** will have some environments for us to use (https://gymnasium.farama.org/)
* **torch** will let us build some neural networks (https://pytorch.org/)
* **Tensorboard** will let us see how well our agent is doing


In [2]:
import gymnasium as gym
import torch
from torch.utils.tensorboard import SummaryWriter
import tianshou as ts


Let's then start a little logger so we can see what is going on. The code below will write out some summary statistics of our agent to the directory log/dqn

In [3]:
#start a logger
logger = ts.utils.TensorboardLogger(SummaryWriter('log/dqn'))

Now let's choose an environment to train our agent in. We'll see how to work with environments and even how to make our own later on.

In [4]:
# Create an envrironment: render mode = human means we'd like to see the environment.
env = gym.make("CartPole-v1", render_mode="human")

In [125]:
env.close()

: 

Let's take a look at this environment a litte. To do this let's reset the environment, then see what the agent can "see" and what actions the agent can take.

In [5]:
#start the environment at the "start"
env.reset()

#take a look at the environment
env.render()

In [122]:
#what can the agent see?
env.observation_space()

TypeError: 'Box' object is not callable

In [None]:
#what can the agent do?
env.action_space()

Let's try out some actions ourselves! Let's take 10 steps to the right.

In [7]:
for i in range(10):
    env.step([0])
    env.render()

AssertionError: [0] (<class 'list'>) invalid

Try taking more actions, can you stabilize the pole?

Ok, let's train an agent to do it!

Let's start building our agent. 

To start off let's build a neural network that take what the agent observes and converts that to actions.

In [101]:
#import a network that we can use
from tianshou.utils.net.common import Net

#What the agent 'sees'
state_shape = env.observation_space.shape or env.observation_space.n

#what actions the agent can take
action_shape = env.action_space.shape or env.action_space.n

#build a network that takes observations and converts it to actions
net = Net(state_shape=state_shape, action_shape=action_shape, hidden_sizes=[128, 128, 128])


Now we'll need to get our agent to learn!

In [102]:
#this will shift the network to better predict actions/values
optim = torch.optim.Adam(net.parameters(), lr=0.001)

In [103]:
policy = ts.policy.DQNPolicy(
    model=net,
    optim=optim,
    discount_factor=0.9,
    action_space=env.action_space,
    estimation_step=3,
    target_update_freq=320
)
train_collector = ts.data.Collector(policy, env, ts.data.VectorReplayBuffer(20000, 1), exploration_noise=True)
test_collector = ts.data.Collector(policy, env, exploration_noise=True)  # because DQN uses epsilon-greedy method (chooses best action, with some noise epsilon)


In [104]:
result = ts.trainer.OffpolicyTrainer(
    policy=policy,
    train_collector=train_collector,
    test_collector=test_collector,
    max_epoch=10,
    step_per_epoch=10000,
    step_per_collect=10,
    episode_per_test=100,
    batch_size=64,
    update_per_step=1 / 10,
    train_fn=lambda epoch, env_step: policy.set_eps(0.1),
    test_fn=lambda epoch, env_step: policy.set_eps(0.05),
    stop_fn=lambda mean_rewards: mean_rewards >= env.spec.reward_threshold,
    logger=logger,
).run()
print(f"Finished training in {result.timing.total_time} seconds")

Epoch #1: 10001it [03:38, 45.71it/s, env_step=10000, gradient_step=1000, len=154, n/ep=0, n/st=10, rew=154.00]                           


Epoch #1: test_reward: 101.480000 ± 8.243155, best_reward: 101.480000 ± 8.243155 in #1


Epoch #2: 10001it [04:04, 40.90it/s, env_step=20000, gradient_step=2000, len=157, n/ep=0, n/st=10, rew=157.00]                           


Epoch #2: test_reward: 193.360000 ± 34.451856, best_reward: 193.360000 ± 34.451856 in #2


Epoch #3: 10001it [09:31, 17.50it/s, env_step=30000, gradient_step=3000, len=185, n/ep=0, n/st=10, rew=185.00]                             


Epoch #3: test_reward: 195.190000 ± 58.612745, best_reward: 195.190000 ± 58.612745 in #3


Epoch #4: 10001it [03:38, 45.86it/s, env_step=40000, gradient_step=4000, len=155, n/ep=0, n/st=10, rew=155.00]                           


Epoch #4: test_reward: 133.250000 ± 7.810730, best_reward: 195.190000 ± 58.612745 in #3


Epoch #5: 10001it [03:38, 45.86it/s, env_step=50000, gradient_step=5000, len=138, n/ep=0, n/st=10, rew=138.00]                           


Epoch #5: test_reward: 165.330000 ± 10.719193, best_reward: 195.190000 ± 58.612745 in #3


Epoch #6: 10001it [03:37, 45.89it/s, env_step=60000, gradient_step=6000, len=170, n/ep=0, n/st=10, rew=170.00]                           


Epoch #6: test_reward: 135.570000 ± 7.498340, best_reward: 195.190000 ± 58.612745 in #3


Epoch #7: 10001it [03:37, 45.88it/s, env_step=70000, gradient_step=7000, len=170, n/ep=0, n/st=10, rew=170.00]                           


Epoch #7: test_reward: 163.600000 ± 12.404032, best_reward: 195.190000 ± 58.612745 in #3


Epoch #8: 10001it [03:37, 45.90it/s, env_step=80000, gradient_step=8000, len=204, n/ep=0, n/st=10, rew=204.00]                           


Epoch #8: test_reward: 237.560000 ± 27.927162, best_reward: 237.560000 ± 27.927162 in #8


Epoch #9: 10001it [17:20,  9.62it/s, env_step=90000, gradient_step=9000, len=151, n/ep=0, n/st=10, rew=151.00]                             


Epoch #9: test_reward: 144.340000 ± 8.509078, best_reward: 237.560000 ± 27.927162 in #8


Epoch #10: 10001it [03:38, 45.84it/s, env_step=100000, gradient_step=10000, len=20, n/ep=0, n/st=10, rew=20.00]                           


Epoch #10: test_reward: 87.760000 ± 26.133549, best_reward: 237.560000 ± 27.927162 in #8
Finished training in 6791.983935117722 seconds


To see the training write:

launch tensorboard in the 

Save the trained agent

In [105]:
torch.save(policy.state_dict(), 'dqn.pth')
policy.load_state_dict(torch.load('dqn.pth'))

  policy.load_state_dict(torch.load('dqn.pth'))


<All keys matched successfully>

Watch the trained agent in the environment

To do so you might have to install pygame

pip install pygame

In [108]:

policy.eval()
policy.set_eps(0.05)
collector = ts.data.Collector(policy, env, exploration_noise=True)
collector.collect(n_episode=1, render=1 / 35, reset_before_collect=True)

CollectStats(n_collected_episodes=1, n_collected_steps=98, collect_time=5.224884033203125, collect_speed=18.756397152018877, returns=array([98.]), returns_stat=SequenceSummaryStats(mean=98.0, std=0.0, max=98.0, min=98.0), lens=array([98]), lens_stat=SequenceSummaryStats(mean=98.0, std=0.0, max=98.0, min=98.0))

In [41]:
env.step([1])

(array([[-0.0050312 ,  0.5639927 , -0.06111142, -0.9501881 ]],
       dtype=float32),
 array([1.]),
 array([False]),
 array([False]),
 array([{'env_id': 0}], dtype=object))

In [52]:
for i in range(10):
    env.step([1])
    #env.render()

In [35]:
env.close()

In [89]:
?obs

[0;31mType:[0m        tuple
[0;31mString form:[0m (array([-0.00426664,  0.00835143,  0.00373624, -0.01403215], dtype=float32), {})
[0;31mLength:[0m      2
[0;31mDocstring:[0m  
Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple.
If iterable is specified the tuple is initialized from iterable's items.

If the argument is a tuple, the return value is the same object.

In [92]:
env.close()

In [93]:
?obs

[0;31mType:[0m        tuple
[0;31mString form:[0m
(array([[-0.00612657,  0.00564542, -0.01911404, -0.04796124]],
      dtype=float32), array([{}], dtype=object))
[0;31mLength:[0m      2
[0;31mDocstring:[0m  
Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple.
If iterable is specified the tuple is initialized from iterable's items.

If the argument is a tuple, the return value is the same object.

In [91]:
#create a new environment
env = ts.env.SubprocVectorEnv([lambda: gym.make("CartPole-v1", render_mode="human") for _ in range(1)])


#setup the policy
policy.eval()
policy.set_eps(0.05) #this gives it some randomness in choosing actions

#get the initial observations
obs = env.reset()

#run a loop!
for i in range(100):
    policy.forward(obs)



    #collector = ts.data.Collector(policy, env, exploration_noise=True)
    #collector.collect(n_episode=1, render=1 / 35)

AttributeError: 'tuple' object has no attribute 'obs'

CollectStats(n_collected_episodes=1, n_collected_steps=500, collect_time=26.80025601387024, collect_speed=18.65653819654668, returns=array([500.]), returns_stat=SequenceSummaryStats(mean=500.0, std=0.0, max=500.0, min=500.0), lens=array([500]), lens_stat=SequenceSummaryStats(mean=500.0, std=0.0, max=500.0, min=500.0))



You should see output that shows how well a PPO agent is doing in the CartPole environment. You can stop the training by clicking ctrl-z.

To see more about this environment you can check out: https://gymnasium.farama.org/environments/classic_control/cart_pole/ 

If this worked you can even try out another RL algorithm on the same environment:

```python -m cleanrl.dqn --env CartPole-v1```

Again you can click ctrl-z to stop the training.

Next, let's look at how to work with cleanrl, e.g., modify/see the code, and how visualize the training!

### 4. Working with cleanRL

To work with cleanrl you can use any editor but were going to show instructions for how to use it with VSCode.

Open VSCode and open your project folder. You should see your virtual environment folder inside, along with a runs folder. The runs folder contains all the runs you tested out in the steps before. Let's see those in a little more detail using tensorboard.

In VSCode install the tensorboard extention. Click on extensions (left side bar), search for tensorboad, and click install.

Now that we have it installed, and the folder we are in has a runs folder, we can run the tensoboard and it will display all the runs in the runs folder. To run tensorboard open the command palette (ctrl-shift-p or view-->command palette), type in tensorboard, and click on lauch tensorboard.

It should show various graphs that we'll get to know better in subsequent tutorials about how the training went. Do you see any differences between the rl algorithms you ran in the previous step?

Finally, let's see some of the code behind these algorithms. To see the ppo code, back in files (left sidebar), if you click on your virtual environment folder, you should see inside a cleanrl folder, inside that you should see the code for each of the algorithms. Click on PPO.py. This file contains all the steps to run and train a PPO agent. We'll get to know all the steps in this file! But for now its good enough to know where the code is and what it looks like!



### 5. Trying out some more examples

If you'd like to play around a little more with agents and environments go check out the environments available:

 https://gymnasium.farama.org

 Scrol down and find the list of Environments, try running the PPO agent on some of them:


 ```python -m cleanrl.ppo --env Environment-name-here```

 e.g., 
```python -m cleanrl.ppo --env Acrobot-v1```
 