# ABE tutorial 1
## Setting up an ABE workshop

In this first tutorial let's setup a workshop to build agents, environments, and RL algorthms!

Steps:
* Install tianshou
* Check that it works
* Explore available algorithms and environments


## Tianshou

Tianshou is a python library that makes working with deep reinforcement learning easier. It's focus is on developing implimentations of reinforcement learning algorithms that can interact with a wide range of environments. You can read more of the documentation here: https://tianshou.org

There are some good tutorials on how to use the different modules of tianshou here: https://tianshou.org/en/stable/02_notebooks/L0_overview.html

Below we'll cover most of what's covered in those tutorials here, but with a focus on what we are covering in the ABE book.


### 1. Create a virtual environment

Creating a virtual environment is an easy way to make sure you can install all the right python libraries and versions for a specific project. First navigate to where you'd like to have your project and create a project folder.

Open up a terminal (command line) and navigate to your folder. Once inside the folder we can create the virtual environment. 

**Check if you have python**

First we have to make sure the right python distribution is installed. Tianshou requries python 3.11 or higher. 

You can check which python you are using with the command:

Mac/Linux
```which python```

Windows
```where python```

**If you don't have python, or an older version**


On Ubuntu you can install a newer version of python via:

```
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.11
```

On windows you can download and install python 3.11 or greater from: https://www.python.org/downloads/



**Create a virtual environment**

Create the virtual environment by running the following:
```
python3.11 -m venv my_venv
```
Or if you already using a version 3.11 or greater then to create the virtual environment run the following:

```
python -m venv my_venv
```

The "my_env" is the name of your environment, feel free to change the name.

You now have a virtual environment!

### 2. Install python libraries

One of the main benifits of a virtual environment is that you can install all the libraries you need within it, without messing up any other python projects.

The first step is to activate your virtual environment. This ensures that any libraries you install will be added to that python environment and not some other environment.


**Activate your virtual environment**

Activate your virtual environment by typing the following into your terminal while inside your project folder (linux/mac):

```
source my_venv/bin/activate
```

or with windows: 
```
my_venv\Scripts\activate
```

You should now see that on your terminal line you have a (my_env) in front of the line. Any libraries we now install will be inside this virtual environent!

**Install tianshou**

Let's install the libraries we'll need. Thankfully it's quite straightforward:

```pip install tianshou```

It should take a little time to install tianshou and all the libraries it depends on (e.g., torch, numpy, etc)

You might also need the library pygame, so let's install that too

```pip install pygame```


### 3. Check that it works!

Here we'll do a quick check that the installation worked. 


In [None]:
import tianshou
print(tianshou.__version__)

### 4. Train our first RL agent

For this first example we'll focus just on the pieces that need to be in place for us to train an RL agent. Later as we move through the tutorials we'll learn more about each of the peices and even start to customize some of them!

But for now let's use existing RL algorithms and some exsiting environments to just see how it all works together.

To start off let's create a new python file, let's call it first_RL.py

Import some libraries:

* **gymnasium** will have some environments for us to use (https://gymnasium.farama.org/)
* **torch** will let us build some neural networks (https://pytorch.org/)
* **Tensorboard** will let us see how well our agent is doing


In [2]:
import gymnasium as gym
import torch
from torch.utils.tensorboard import SummaryWriter
import tianshou as ts

Let's then start a "logger" so we can see what is going on. The code below will write out some summary statistics of our agent to the directory log/dqn

In [3]:
#start a logger
logger = ts.utils.TensorboardLogger(SummaryWriter('log/dqn'))

### Setup an environment

Now let's choose an environment to train our agent in. We'll see how to work with environments and even how to make our own later on.

In [4]:
# Create an envrironment: render mode = human means we'd like to see the environment.
env = gym.make("CartPole-v1", render_mode="human")

Let's take a look at this environment. To do this let's reset the environment, then see what the agent can "see" and what actions the agent can take.

In [5]:
#start the environment at the "start"
env.reset()

#take a look at the environment
env.render()

You should now see a simple game pop up, with a brown pole being balanced by a black block. The agent moves the black block left or right in an attempt to keep the pole upright (i.e., less than 12 degrees from verticle).

Let' take a look at what the agent sees:

In [None]:
#what can the agent see?
env.observation_space

We can see that it "sees" about 4 values:
> Cart Position in the x-axis

> Cart Velocity in the x-axis

> Pole Angle from verticle

> Pole Angular velocity


What actions can the agent take in this environment?

In [None]:
#what can the agent do?
env.action_space

It can take two discrete actions:

> Push the cart to the left

> Push the cart to the right

Let's try out some actions ourselves! Let's take 10 steps to the right.

> A right action is coded as 0, and a left is 1

In [None]:
env.step(0)

You should see an output of the new observations after taking the step.

Try running the code block above a few times, and change the actions. See if you can balance the pole!

You can even try taking a few steps at a time using a loop below:

In [9]:
for i in range(10):
    env.step(1)

Try taking more actions, can you stabilize the pole? :-)

Ok, let's train an agent to do it!

In [10]:
#let's close this instance of the environment for now... we'll open up one again below
#but we'll leave the render_mode="human" so that it runs faster (i.e., doesn't have to create an image for us to look at)
env.close()

### Setup an agent

Let's start building our agent. 

To start off let's build a neural network that take what the agent observes and converts that into actions.

In [12]:
#import a network that we can use
from tianshou.utils.net.common import Net
from tianshou.utils.space_info import SpaceInfo

#get an instance of the environment
env = gym.make('CartPole-v1')

#get all the info about it
space_info = SpaceInfo.from_env(env)

#What the agent 'sees'
state_shape = space_info.observation_info.obs_shape

#what actions the agent can take
action_shape = space_info.action_info.action_shape

#build a network that takes observations and converts it to actions
net = Net(state_shape=state_shape, action_shape=action_shape, hidden_sizes=[128, 128, 128])


The hidden_sizes argument above is for setting the number of nodes within each neural network layer that link the initial input (i.e., observations of the current state: 4) and the possible actions (i.e., 2 discrete actions)

Now we'll need to build an optimizer to allow our agent to learn! This optimizer will adjust the weights in the neural network to link observations to actions that lead to more rewards.

In the case of the carte pole environment the rewards are the steps where the pole is held upright (i.e., less that 12 degrees from verticle).

We'll use a pre-built optimizer called Adam, with a learning rate of 0.001. The learning rate determines how quickly the agent adapts it's weights during each step. This will be a hyperparameter that we will use more later in the book.

In [13]:
#this will shift the network to better predict actions/values
optim = torch.optim.Adam(net.parameters(), lr=0.001)

Now that we have a network and an optimizer let's define a policy that will control how learning takes place.

> The discount factor is how much the agent takes into acount future rewards vs. immediate rewards. A choice of 0.9 suggest that the agent should prioritize future rewards, while a choice of 0.1 suggests the agent should prioritize immediate rewards.

> estimation_step is how many steps into the future the agent should look when calculating the value of different actions.

> target_update_freq is how many steps should be taken before updating the network weights to match what the agent is learning.

Some of these parameters are specific to the RL algorithm we are using here (i.e., estimation_step, and target_update_freq).

In [14]:
policy = ts.policy.DQNPolicy(
    model=net,
    optim=optim,
    discount_factor=0.9,
    action_space=env.action_space,
    estimation_step=3,
    target_update_freq=320
)

Now let's setup a collector to feed observations to the policy as the agent interacts with it's environment.

> We'll add a test collector that will run tests periodically to see how well our agent is performing.

In [None]:
train_collector = ts.data.Collector(policy, env, ts.data.VectorReplayBuffer(20000, 1), exploration_noise=True)
test_collector = ts.data.Collector(policy, env, exploration_noise=True)  # because DQN uses epsilon-greedy method (chooses best action, with some noise epsilon)

Now that we have:

1. An environment
2. A Policy with a network model and an optimizer
3. A collector to store the agent experiences

We can now train our agent!

We'll use something called an Off Policy Trainer for now. This trainer controls the learning of a main offline neural network model, and only periodically updates a second version of this neural network that is used by the agent to make descisions. This helps with the stability of the training/learning. However, we'll see in a few tutorials how we can have a fully online trainer where there is no distinction between an off line neural network model and the one being used by the agent to learn. 

Parameters:
> max_epochs is the number of rounds of training to run before stopping the training.

> steps_per_epoch is the number actions the agent will take per epoch

> steps_per_collect is the number of actions to take before collecting experiences in the replay buffer (a list of stored experiences)

> episode_per_test is the number of episodes to run during testing that occurs at the end of an epoch. This estimates how much our agent has learned.

> batch_size is the amount of experiences to take from the replay buffer when trianing the neural network model.

> train_fn is a function that is called at the start of each training epoch. Here it sets the eps (epsilon) paramter to 0.1. Telling the agent to try an exploritory action 10% of the time, rather than what the agent thinks is the best current action.

> test_fn is the same as train_fn, just with the test environment.

> stop_fn is a function that will stop the training if its conditions are met. Here is stops when the rewards reach a specific threshold.


In [None]:
result = ts.trainer.OffpolicyTrainer(
    policy=policy,
    train_collector=train_collector,
    test_collector=test_collector,
    max_epoch=30,
    step_per_epoch=10000,
    step_per_collect=10,
    episode_per_test=100,
    batch_size=64,
    update_per_step=1 / 10,
    train_fn=lambda epoch, env_step: policy.set_eps(0.1),
    test_fn=lambda epoch, env_step: policy.set_eps(0.05),
    stop_fn=lambda mean_rewards: mean_rewards >= env.spec.reward_threshold,
    logger=logger,
).run()
print(f"Finished training in {result.timing.total_time} seconds")

The code above can take a while to run! To see the training in progress you can launch tensorboard.

If you are using VSCode you can open command pallette and write:

```launch tensorboard```

Once the code has finished you can save the trained agent to be used later!

In [23]:
torch.save(policy.state_dict(), 'models/dqn.pth')


You can then load the agent.

In [None]:
policy.load_state_dict(torch.load('models/dqn.pth'))

Let's see how this works by watching the trained agent in a new environment.


In [None]:
#build a new environment, and make it start at the beginning
env = gym.make("CartPole-v1", render_mode="human")
env.reset()

#tell the policy that we are in evaluation mode
policy.eval()

#give some noise to the policy choices of actions
policy.set_eps(0.05)

#Create a collector
collector = ts.data.Collector(policy, env, exploration_noise=True)

#Use the collector to run the agent and environment at 35 frames per second
collector.collect(n_episode=1, render=1 / 35, reset_before_collect=True)

How does your trained agent do? Was it better than you at keeping the pole upright?

In [20]:
#close your environment
env.close()

**Things to try**

> Changing the environment to another classic control environment

>> Go to https://gymnasium.farama.org/environments/classic_control/

>> Choose another environment (it has to have discrete actions with the RL algorithm we are using here!): mountain car or acrobot. We'll learn in more depth other algorithms that can do both discrete and continuous actions.

>> Use the code above as a guide and attempt to run an agent on one of these new environments below!



In [21]:
#try out another environment