## Our Environment
[Is there some way to visualize our env in a good way? Overlay all the Swaptions and Qs as semi transparent red and blue then add CVA as solid?]

Our environment is much like the Mountain Car environment in many key ways, it is also continous in input and output, it is an optimal control problem, it is markov. But there are a few key differences to catch. First of all we have an 18 dimensional action space, and a 37 dimensional input space, so it is a much larger space. Another difference to note is that our environment has a fixed termination date after 9 (10) years, so we should keep this in mind when we think about discounting and planning.

#### Action/Obs Space 
##### Observation
The observation is a `ndarray` with shape `(37,)` where the elements correspond to the following:
|Num  |Observation                                                       |Min   |Max     |Unit              |
|-----|------------------------------------------------------------------|------|--------|------------------|
|0-8  |Fraction of portfolio value in Swaptions expiring in year 1 to 9  | 0    | 1      | float - fraction | 
|9-17 |Fraction of portfolio in defaulting after year 1 to 9             | 0    | 1      | float - fraction |
|18-26|Swaptions expiring in year 1 to 9 with strike at Swap strike      | 0    | Inf*   | float - $ value  | 
|27-35|Probability of defaulting after year 1 to 9                       | 0    | 1      | float - $ value  |
|36   |Current interest rate                                             | -Inf** | Inf**| float - %        |

*In practice very small \
**In practice between around 10 to -3

##### Action
The action is a `ndarray` with shape `18` where the elements correspond to the following:
|Num  |Action                                                            |Min   |Max    |Unit              |
|-----|------------------------------------------------------------------|------|-------|------------------|
|0-8  |Fraction of portfolio value in Swaptions expiring in year 1 to 9  | 0    | 1     | float - fraction | 
|9-17 |Fraction of portfolio in defaulting after year 1 to 9             | 0    | 1     | float - fraction |


#### Dynamics
The Dynamics obey our random market simulation, most prices will move continously and in a locally bounded manner. Some prices drop to 0 after expiry, but if the market attempts to buy them it is taken care of by the environment. There's more to say here

#### Goal and Rewards
The goal of the model is to minimize some loss metric related to risk. I chose to approximate this as being minimizing the sum of squared stepwise P&L (Price and Loss) as a standin for variance. We could (as in mountain car) add a punishment term for buying expired assets which might needlessly complicate training, or might reduce variance and improve it...
$$r_i = -(\mathrm{CVA_i} - \mathcal{P}_{\mathrm{hedge}})^2$$

#### Initial State
The initial state of the market is an ATM swaption and derivatives to coincide, there is some constant initial default risk. The intial hedge is an even spread in value across all of the assets but this should probably be changed to a delta hedge once we have that.

#### Episode End
The episode ends after year 9 when the naive CVA is zero.

## Code

### Imports and Environment

In [4]:
import gymnasium as gym
from stable_baselines3 import DDPG
from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise

import numpy as np

import collections
import gymnasium as gym
import numpy as np
import os

import tqdm

from matplotlib import pyplot as plt
from typing import Any, List, Sequence, Tuple

import pickle

import path_datatype
import sys

from env import tradingEng


# Define environment
with open("1.6kRunDemo.pkl","rb") as fp:
    paths = pickle.load(fp)
env = tradingEng(paths)


### Making the Agent

In [None]:
n_actions = 18
action_noise = OrnsteinUhlenbeckActionNoise(mean = np.zeros(n_actions), sigma = 0.05*np.ones(n_actions), theta = 0.01)
model = DDPG("MlpPolicy", env, action_noise=action_noise, verbose=1, batch_size=25)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


### Training the Agent

In [6]:
Nruns = 10
model.learn(total_timesteps=251*10*Nruns, log_interval=2)
model.save("ddpg_fin")

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1.18e+03 |
|    ep_rew_mean     | nan      |
| time/              |          |
|    episodes        | 2        |
|    fps             | 70       |
|    time_elapsed    | 33       |
|    total_timesteps | 2359     |
| train/             |          |
|    actor_loss      | nan      |
|    critic_loss     | nan      |
|    learning_rate   | 0.001    |
|    n_updates       | 2258     |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1.18e+03 |
|    ep_rew_mean     | nan      |
| time/              |          |
|    episodes        | 4        |
|    fps             | 73       |
|    time_elapsed    | 64       |
|    total_timesteps | 4702     |
| train/             |          |
|    actor_loss      | nan      |
|    critic_loss     | nan      |
|    learning_rate   | 0.001    |
|    n_updates       | 4601     |
--------------

### Run a test

In [None]:

from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
num_eval_episodes = 1

env = tradingEng(paths)

episode_over = False
rewards = list()
actions = list()
obs, info = env.reset()
while not episode_over:
    action, _states = model.predict(obs, deterministic=True)  # replace with actual agent
    obs, reward, terminated, truncated, info = env.step(action)
    rewards.append(reward)
    actions.append(action)
    episode_over = terminated or truncated
env.close()

print(f'Example action taken: {actions[0]}')
print(f'Episode rewards: {rewards}')

ValueError: You have passed a tuple to the predict() function instead of a Numpy array or a Dict. You are probably mixing Gym API with SB3 VecEnv API: `obs, info = env.reset()` (Gym) vs `obs = vec_env.reset()` (SB3 VecEnv). See related issue https://github.com/DLR-RM/stable-baselines3/issues/1694 and documentation for more information: https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecenv-api-vs-gym-api