# Tutorial

In this tutorial we will explore some of the gymCICY environments and construct a simple A3C agent using the ChainerRL library.

We begin by importing some relevant libraries.

In [None]:
import numpy as np
import gym
import gymCICY
import os
from pyCICY import CICY

## The physical setting

Lets pick a well studied example from the [literature](https://arxiv.org/pdf/1106.4804.pdf). Take the CICY with index 6777 from the [cicylist](http://www-thphys.physics.ox.ac.uk/projects/CalabiYau/cicylist/cicylist.txt). It has the following configuration matrix

$$
	\mathcal{M}_{6777} =  \left[
	\begin{array}{c||cccc}
	1 & 1 & 1 & 0 & 0 \\
	1 & 0 & 0 & 0 & 2 \\
	1 & 0 & 0 & 2 & 0 \\
	1 & 2 & 0 & 0 & 0 \\
	3 & 1 & 1 & 0 & 1
	\end{array}
	\right]^{5,37}_{-64}.
$$

We now define a CICY object using the pyCICY library.

In [None]:
conf = np.array([[1,1,1,0,0], [1,0,0,0,2], [1,0,0,2,0], [1,2,0,0,0], [3,1,1,1,1]])
M = CICY(conf)

it is known that the following vector bundle
$$
V = (1,0,0,-1,0) \oplus (1,-1,-2,0,1) \oplus (0,1,1,1,-1) \oplus (0,-1,1,0,0) \oplus (-2,1,0,0,0)
$$
leads to a string compactification with 'realistic' gauge group and particle content when using a $\mathbb{Z}_2$ Wilson line. Next we will explore the f4p1 environment and try to recover this results.

## gymCICY

As explained in [Branes with Brains](https://arxiv.org/abs/1903.11616) the agents, in this case us, interact with the gym environment by performing an action via

```python
obs, r, done, info = env.step(action)
```

The action leads to 

1. an observation, *obs*, which in our case is the sum of line bundles V. It is needed for the agent to determine its next action.
2. a reward, *r*, judging the performed action. The reward is determined by how many model building constraints have been satisfied.
3. a boolean, *done*. It is True if a model has been found and all constraints are satisfied, otherwise False. If True one has to reset the environment. 
4. and possibly some additional information, *info*. Here, we slightly break with usual gym notation and actually return information about the found model. If no model has been found this is an empty dict.

We will next define the f4p1 gym environment and explore the setting in more detail.

We begin by registering the environment. It requires three parameters from us.

1. A CICY object $M$
2. the rank $|\Gamma|$ of the freely acting symmetry
3. $q_{max}$ the maximal allowed charge for the first four line bundles.

From before we have $q_{max}:=2$, $|\Gamma| := 2$ and $M := \mathcal{M}_{6777}$

In [None]:
rank = 2
qmax = 2
gym.envs.register(
        id='CICY-v1',
        entry_point='gymCICY.envs.f4p1:f4p1',
        kwargs={'M': M, 'r': rank, 'max': qmax},
        )

After the registration we are in a position to create an OpenAI environment object.

In [None]:
env = gym.make('CICY-v1')

The first step when creating a new environment is to set a seed and reset the observation space.

In [None]:
env.seed(2020)
new_obs = env.reset()
new_obs

We see that the current vector bundle is
$$
V_{0} = (-1,-1,1,0,-1) \oplus (0,-1,-1,-1,1) \oplus (0,1,1,1,0) \oplus (-1,-1,-1,-1,1) \oplus (2,2,0,1,-1)
$$
The first four line bundles have charges $q_i^j \in \{-1,0,1\}$ and the last line bundle fixes $c_1(V) =0$.
We will need to perform quite some actions in order to match the realistic configuration from above. Starting with the first line bundle we increase the first charge by one step.

In [None]:
new_obs, r, done, info = env.step(0)
print('The new observation is: \n {}'.format(new_obs))
print('The reward for our action is: {}'.format(r))
print('Did we find a model? {}'.format(done))
print('Do you want to tell us something about this model? \n {}'.format(info))

We see that the first charge has increased by one. This was compensated by a change in the first charge of the fifth line bundle. We note that this is not a very good configuration, as we recieve a negative reward of -0.6. We already fail the first reward check, since the third line bundle is semipositive and thus can not have slope zero anywhere in the Kähler cone.

The next actions we have to do in order to reach our model are the following:

In [None]:
actions = [0,1,22,23,4,5,27,8,34,15,17,17,18,39]

Note, an action value of $3+4\cdot h^{1,1}$ decreases the 3+1th charge. Lets iterate over the actions and observe:

In [None]:
for a in actions:
    new_obs, r, done, info = env.step(a)
    print('The new observation is: \n {}'.format(new_obs))
    print('The reward for our action is: {}'.format(r))
    print('Did we find a model? {}'.format(done))
    print('Do you want to tell us something about this model? \n {}'.format(info))

and in fact we made it. Congratulation a first model has been found! It is interesting how highly nonlinear our space is, changing a single charge resulted in a change of reward from -0.4 to 11020102.0 .

## A3C agent

In this section we will build a simple A3C agent using Chainer RL. [A3C agents](https://arxiv.org/abs/1602.01783) are defined by a policy and a state value function which are being approximated by two neural networks sharing the first layers of weights.

First we import some useful Chainer functions.

In [None]:
import chainer
import chainerrl
import chainer.functions as F
from chainerrl.agents import a3c
from chainerrl import experiments
import chainer.links as L
from chainerrl import misc
from chainerrl.optimizers.nonbias_weight_decay import NonbiasWeightDecay
from chainerrl.optimizers import rmsprop_async
from chainerrl import policy
from chainerrl import v_function

We take a simple NN with two hidden layers and each having $n_{hidden}$ Neurons using ReLU activation functions. We can define such a NN by chaining the respective layers together into a class.

In [None]:
class Body(chainer.Chain):

    def __init__(self, obs_size, n_hidden_channels):
        
        # save input and output length in its own variables
        self.n_output_channels = n_hidden_channels
        self.n_input_channels = obs_size
        
        super().__init__()
        
        with self.init_scope():
            # input layer
            self.l0 = L.Linear(obs_size, n_hidden_channels)
            # first hidden
            self.l1 = L.Linear(n_hidden_channels, n_hidden_channels)
            # second hidden
            self.l2 = L.Linear(n_hidden_channels, n_hidden_channels)

    def __call__(self, obs):
        # we feed the observation through the linear layers
        # and apply the non linearity via a ReLU function. 
        h = F.relu(self.l0(obs))
        h = F.relu(self.l1(h))
        return F.relu(self.l2(h))

Having defined the body we can complete the A3Cagent by introducing a softmax layer for the policy and another linear layer with one outgoing node for the state value function.

In [None]:
class A3Cagent(chainer.ChainList, a3c.A3CModel):

    def __init__(self, n_input, n_actions, n_hidden):

        self.head = Body(n_input, n_hidden)
        self.pi = policy.FCSoftmaxPolicy(self.head.n_output_channels, n_actions)
        self.v = v_function.FCVFunction(self.head.n_output_channels)
        super().__init__(self.head, self.pi, self.v)

    def pi_and_v(self, state):
        out = self.head(state)
        return self.pi(out), self.v(out)

We are now in a position to define our A3C model

In [None]:
# set the parameters
obs_size = 5*M.len
n_actions = 2*4*M.len
n_hidden = 100

# define the model
model = A3Cagent(obs_size, n_actions, n_hidden)

Having set up the model we can feed it into the [A3C chainer class](https://github.com/chainer/chainerrl/blob/master/chainerrl/agents/a3c.py). This one requires a couple more arguments however. 

We feed it the following hyperparameters:

1. $\gamma$ the discount factor for accumulated reward
2. $t_{max}$ the number of time steps afet which the weights are updated
3. the optimization algorithm; we use RMsprop with learning rate $lr$ and momentum $\alpha$ and numerical stability parameter $\epsilon$
4. $\beta$ the regularization parameter sitting in front of the entropy loss term
5. $\phi$ a function applied on the observation space.
6. gradient clipping $gc$

In [None]:
beta = 1
gamma = 0.95
tmax = 5
lr = 0.0001
alpha = 0.99
epsilon = 0.00001
opt = rmsprop_async.RMSpropAsync(lr=lr, eps=epsilon, alpha=alpha)
gc = 20
phi = lambda x: x.astype(np.float32, copy=False)

we define the A3C agent

In [None]:
agent = a3c.A3C(model, opt, t_max=tmax, gamma=gamma,
                    beta=beta, phi=phi)

and setup the optimizer

In [None]:
opt.setup(model)
opt.add_hook(chainer.optimizer.GradientClipping(gc))

In a last step, we let the agent train using the Chainer experiment [*train_agent_async*](https://github.com/chainer/chainerrl/blob/master/chainerrl/experiments/train_agent_async.py).
This one requires

1. a function that creates the gym environments for every worker with different seeds 
2. the number of training steps
3. the number of workers/threads used for training
4. the number of evaluation runs and interval
5. the number of maximal steps per Episode
6. a decay hook for the learning rate

and more ... .

define the auxillary functions

In [None]:
def lr_setter(env, agent, value):
    agent.optimizer.lr = value
    
def make_env(process_idx, test):
    process_seed = process_seeds[process_idx]
    env_seed = 2 ** 31 - 1 - process_seed if test else process_seed
    env = gym.make('CICY-v1')
    env.seed(int(env_seed))
    return env

set the remaining hyperparameters

In [None]:
import multiprocessing

In [None]:
seed = 42
nsteps = 10**6
eval_n_runs = 10
eval_interval = 50000
max_episode_len = 200
threads = multiprocessing.cpu_count()
lr_decay_hook = experiments.LinearInterpolationHook(nsteps, lr, 0, lr_setter)
outdir = 'results'
process_seeds = np.arange(threads) + seed * threads

In [None]:
if not os.path.exists(outdir):
    os.makedirs(outdir)

Before you run the next cell, make sure you double checked all the previous hyperparameters and make sure they are compatible with your system (e.g. number of threads ..). 

Furthermore, if you want to know the number of models found during training the training loop in ChainerRL has to be changed according to the [readme](https://github.com/robin-schneider/gymCICY)

Training will take some time.

In [None]:
training = experiments.train_agent_async(
        agent=agent,
        outdir=outdir,
        processes=threads,
        make_env=make_env,
        steps=nsteps,
        eval_interval=eval_interval,
        eval_n_episodes=eval_n_runs,
        max_episode_len=max_episode_len,
        global_step_hooks=[lr_decay_hook])

## Plotting the Results

Finally we can load the scores and see how our agents performed

In [None]:
import pandas as pd
import json
import matplotlib.pyplot as plt

In [None]:
scores = []
models = []

We parse through the results folder

In [None]:
for root, dirs, files in os.walk(outdir):
    if 'scores.txt' in files:
        scores += [pd.read_csv(os.path.join(root, 'scores.txt'), delimiter='\t')]
    if 'model_info' in files:
        tmp_models = []
        with open(os.path.join(root, 'model_info')) as f:
            fline = f.readlin()
            tmp_models += [json.loads(fline)]
        models += [tmp_models]

Finally, we make some plots. Here is data collected in the evaluation runs.

In [None]:
scores[0]

In [None]:
print(scores[0].plot(x='steps', y='mean'))
print(scores[0].plot(x='steps', y='average_value'))
print(scores[0].plot(x='steps', y='average_entropy'))

and more interestingly we plot the models found. This step assumes that the ChainerRL training loop has been changed previously according to the [readme](https://github.com/robin-schneider/gymCICY).

In [None]:
if len(models) != 0:
    nmodels = np.arange(len(models[0]))
    time_steps = np.array([m['gt'] for m in models[0]])
    plt.title('6777 - flipping')
    plt.xlabel('steps')
    plt.ylabel('\# of models')
    plt.plot()
    plt.legen(loc='best')