In [None]:
# Let's import some useful libraries
import warnings
warnings.filterwarnings(action="ignore", category=FutureWarning)
%matplotlib notebook
%load_ext autoreload
%autoreload 2
from change_param import Param
import matplotlib.pyplot as plt
import gym
import numpy as np
import param
from gym_film.envs import make_env
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines.common.vec_env import DummyVecEnv
p= Param()

# Exploiting translational invariance with a CNN architecture

We just said earlier that our policy should not be that different with either $10$ or $11$ jets.

That's because of the **translational invariance** of the simulation. 

Indeed, the physics of our simulation are the same at $x = 50$ and at $x = 200$, which means that **our jets should have more or less the same behavior wherever they are.**

Remember how our agent uses a neural network to transform an **observation** into an **action** ?

We used to concatenate the observation of each jet into one single observation, but what we could do is apply the same transformation on each observations so that our policy is the same for each jet.

Such a model is simplified on this figure right here:

![graph2](img/method2.png)

### A convolutional trick

Using **Convolution Neural Networks** allow us to have the exact same transformations on each row of our input. This means that if **each** of our jets had the **same observation**, they would be given the **same action** on the next step (not exactly the same actually because we use a stochastic policy, but the output of our neural network would be the same for each jet).

### Let's train an agent with $5$ jets with this new trick
We will use the same parameters as in the previous failed training

Training on $40 000$ environment steps, the training should last around $10$ minutes. Go take a coffee or something

In [None]:
position_first_jet = 170
# jet power
size_obs_to_reward=10
n_jets = 5
n_cpu = 1
JET_MAX_POWER=5.0
p.update_dic({'n_jets': n_jets, 
              'position_first_jet': position_first_jet,
              'size_obs_to_reward':size_obs_to_reward,
              'n_cpu':n_cpu,
              'JET_MAX_POWER': JET_MAX_POWER})

In [None]:
from gym_film.envs import make_env
from stable_baselines import PPO2
from gym_film.model.custom_shared_mlp import CustomPolicy
policy = CustomPolicy

envs = make_env.make_env('1env_njet', param.n_jets, param.jets_position, render=False)
env=DummyVecEnv(envs)
obs = env.reset()

model = PPO2(policy, env=env, n_steps=param.nb_timestep_per_simulation, verbose=1)

# Let's train him for 40 000 environment steps
n_step_training = 400*100 # 1 episode is 400 steps
model.learn(n_step_training)

And let's render it :

In [None]:
from gym_film.envs import make_env
envs = make_env.make_env('1env_njet', param.n_jets, param.jets_position, render=True, plot_jets=True)
env=DummyVecEnv(envs)
obs = env.reset()

# Duration of the rendering here - 
# you can increase it to see how the control adapt to big waves created by a perturbation jet
time_simulation = 20
render_total_timesteps = int(time_simulation/param.simulation_step_time)

obs = env.reset()
for i in range(render_total_timesteps):
    use_agent = True
    if use_agent:
        action, _states = model.predict(obs)
    else:
        action = [np.array([0 for k in range(param.n_jets)])]
    obs, rewards, done, info = env.step(action)

Better than before, right ? (or not ? The learning process is not deterministic, so it can be pretty bad still)

But remember what we said earlier about **using a single reward for all the jets** - we are still doing that here.

But we can fix that.

### Lets go to the [next and final notebook](Method M3.ipynb)