# Prereqs
---
Create your [anaconda environment](https://www.anaconda.com/download/success). For this guide I used the command `conda create -n gymnasium python=3.12.2` and then `conda activate gymnasium`.
## Conda Cmds
---
```bash
conda install conda-forge::gymnasium
conda install conda-forge::gymnasium-box2d
conda install conda-forge::stable-baselines3
```

In [None]:
import gymnasium as gym
from stable_baselines3 import DQN
import time

# Running the environment
In this example im going to be using [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/). One of the presets for this game is `"LunarLander-V2"` so i will be using this when creating the environment.

This process isn't training the agent at all right now we are just selecting a random action with `action = env.action_space.sample()` and taking that action with `env.step(action)`

The `env.render()` allows use to see hte agent playing the game in it's own window  with `render_mode="human"`

This condition just checks to see if the environment has stopped and then reseting it with `env.reset()` when met
```py
if terminated or truncated:
    observation, info = env.reset()
```


In [None]:
env = gym.make("LunarLander-v2", render_mode="human")
observation, info = env.reset()

print("running 500 steps of random actions")
for i in range(500):
    print(f"step {i+1}", end="\r")
    env.render()
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()
time.sleep(2)
env.close() # here just because closing the window manually barely works

# Training
Training requires a policy. DQN (Deep Q Network) is provided by SB3 and is compatible with the Discrete action space and the Box observation space of lunar lander. 

First I'm recreating the environment `env` variable only because i did `env.close()` in the last block. Then, I'm using `model.learn()` to train our model for 20k steps.

The `verbose=1` will split out a lot of logs just to show that its working. Later on we will ahve more sophisticated methods for this.

This training takes my pc just  on CPU around 10 minutes so feel free to lower the step count if its taking too long.
## Logging
Here is an example of what the default log will look like. 

`total_timesteps` shows how many steps it has taken so far from this attempt of training. 

`ep_reward_mean` shows the average reward of the agent each episode. Once being more familiar with the environment we will log more info and create graphs to gain insight on training.
```
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 115      |
|    ep_rew_mean      | -270     |
|    exploration_rate | 0.455    |
| time/               |          |
|    episodes         | 10       |
|    fps              | 47       |
|    time_elapsed     | 24       |
|    total_timesteps  | 1147     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.5      |
|    n_updates        | 261      |
----------------------------------
```
## Saving
SB3 models have built in helper functions for saving your model. Using the function below it will save the current agent into a zip file with the path you provide. 
```py
model.save("dqn_lunar_lander")
```
These models can then be loaded again (useful for fine tuning a model) with the function below
```py
model = DQN.load("dqn_lunar_lander")
```

In [None]:
print("training 20k steps of DQN")
env = gym.make("LunarLander-v2", render_mode="human")
model = DQN("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=20000, log_interval=1)
model.save("dqn_lunar_lander")

# Using A Trained Model
First I load the model trained in the above cell, and now I am able to interact with the environment based on what hte previous agent learned.
To do so we will use `model.predict(obs, deterministic=True)`. We are passing in the obs (observation) _which is the information the environment provides to the agent based on its current state_. This function will return an action and states (states are not needed).

Now instead of using a random action from before we now take the trained models prediction as our next step.

Based on how many timesteps you trained your model on it should find some sort of pattern when landing, whether it is successful or not is based on how much training was done.

Interacting with the environment in this manner is not training the model with as we take steps since we are not feeding any more data into the model. This allows us to run **evaluations** on the model and statistically test how fit the model is.

In [None]:
model = DQN.load("dqn_lunar_lander")
print("loaded trained model")

env = gym.make("LunarLander-v2", render_mode="human") # recreating env because loading the model and the env are separate
obs, info = env.reset()
print("running 500 steps of model predictions")
for i in range(500):
    print(f"step {i+1}", end="\r")
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()
time.sleep(2)
env.close()