# Tutorial ACS_UPB_LAB1: Running Sumo Simulations

__Credits: most of the credits for this ipynb goes to https://github.com/flow-project/flow/tree/master/tutorials__

This tutorial walks through the process of running non-RL traffic simulations in Flow. Simulations of this form act as non-autonomous baselines and depict the behavior of human dynamics on a network. Similar simulations may also be used to evaluate the performance of hand-designed controllers on a network. This tutorial focuses primarily on the former use case, while an example of the latter may be found in `exercise07_controllers.ipynb`.

In this exercise, we simulate a initially perturbed single lane ring road. We witness in simulation that as time advances the initially perturbations do not dissipate, but instead propagates and expands until vehicles are forced to periodically stop and accelerate. For more information on this behavior, we refer the reader to the following article [1].

## 1.1 Components of a Simulation
All simulations, both in the presence and absence of RL, require two components: a *network*, and an *environment*. Networks describe the features of the transportation network used in simulation. This includes the positions and properties of nodes and edges constituting the lanes and junctions, as well as properties of the vehicles, traffic lights, inflows, etc. in the network. Environments, on the other hand, initialize, reset, and advance simulations, and act the primary interface between the reinforcement learning algorithm and the network. Moreover, custom environments may be used to modify the dynamical features of an network.

## 1.2 Setting up the environment of current lab (ENV1)
Load configurations for lab 1.

## 2. Setting up a Network
Flow contains a plethora of pre-designed networks used to replicate highways, intersections, and merges in both closed and open settings. All these networks are located in flow/networks. In order to recreate a ring road network, we begin by importing the network `RingNetwork`.

In [19]:
from flow.envs.nemodrive_lab import ENV2 as ENV

# from flow.networks.figure_eight import FigureEightNetwork
network_name = ENV["NETWORK"]
print(network_name.__name__)

FigureEightNetwork


This network, as well as all other networks in Flow, is parametrized by the following arguments: 
* name
* vehicles
* net_params
* initial_config
* traffic_lights

These parameters allow a single network to be recycled for a multitude of different network settings. For example, `RingNetwork` may be used to create ring roads of variable length with a variable number of lanes and vehicles.

### 2.1 Name
The `name` argument is a string variable depicting the name of the network. This has no effect on the type of network created.

In [20]:
name = network_name.__name__

### 2.2 VehicleParams
The `VehicleParams` class stores state information on all vehicles in the network. This class is used to identify the dynamical behavior of a vehicle and whether it is controlled by a reinforcement learning agent. Morover, information pertaining to the observations and reward function can be collected from various get methods within this class.

The initial configuration of this class describes the number of vehicles in the network at the start of every simulation, as well as the properties of these vehicles. We begin by creating an empty `VehicleParams` object.

In [21]:
vehicles = ENV["VEHICLES"]()

# code in get_vehicles 
# from flow.core.params import VehicleParams

# vehicles = VehicleParams()

Once this object is created, vehicles may be introduced using the `add` method. This method specifies the types and quantities of vehicles at the start of a simulation rollout. For a description of the various arguements associated with the `add` method, we refer the reader to the following documentation ([VehicleParams.add](https://flow.readthedocs.io/en/latest/flow.core.html?highlight=vehicleparam#flow.core.params.VehicleParams)).

When adding vehicles, their dynamical behaviors may be specified either by the simulator (default), or by user-generated models. For longitudinal (acceleration) dynamics, several prominent car-following models are implemented in Flow. For this example, the acceleration behavior of all vehicles will be defined by the Intelligent Driver Model (IDM) [2].

In [22]:
# code in get_vehicles 
# from flow.controllers.car_following_models import IDMController

Another controller we define is for the vehicle's routing behavior. For closed network where the route for any vehicle is repeated, the `ContinuousRouter` controller is used to perpetually reroute all vehicles to the initial set route.

In [23]:
# code in get_vehicles 
# from flow.controllers.routing_controllers import ContinuousRouter

Finally, we add 22 vehicles of type "human" with the above acceleration and routing behavior into the `Vehicles` class.

In [24]:
# (E.g. code in get_vehicles)
# vehicles.add("human",
#              acceleration_controller=(IDMController, {}),
#              routing_controller=(ContinuousRouter, {}),
#              num_vehicles=22)

### 2.3 NetParams

`NetParams` are network-specific parameters used to define the shape and properties of a network. Unlike most other parameters, `NetParams` may vary drastically depending on the specific network configuration, and accordingly most of its parameters are stored in `additional_params`. In order to determine which `additional_params` variables may be needed for a specific network, we refer to the `ADDITIONAL_NET_PARAMS` variable located in the network file.

In [25]:
# from flow.networks.ring import ADDITIONAL_NET_PARAMS

ADDITIONAL_NET_PARAMS = ENV["ADDITIONAL_NET_PARAMS"]

print(ADDITIONAL_NET_PARAMS)

{'radius_ring': 60, 'lanes': 2, 'speed_limit': 30, 'resolution': 40}


Importing the `ADDITIONAL_NET_PARAMS` dict from the ring road network, we see that the required parameters are:

* **length**: length of the ring road
* **lanes**: number of lanes
* **speed**: speed limit for all edges
* **resolution**: resolution of the curves on the ring. Setting this value to 1 converts the ring to a diamond.


At times, other inputs may be needed from `NetParams` to recreate proper network features/behavior. These requirements can be founded in the network's documentation. For the ring road, no attributes are needed aside from the `additional_params` terms. Furthermore, for this exercise, we use the network's default parameters when creating the `NetParams` object.

In [26]:
from flow.core.params import NetParams

net_params = NetParams(additional_params=ADDITIONAL_NET_PARAMS)
print(net_params)

<flow.core.params.NetParams object at 0x7ff3f42c76d8>


### 2.4 InitialConfig

`InitialConfig` specifies parameters that affect the positioning of vehicle in the network at the start of a simulation. These parameters can be used to limit the edges and number of lanes vehicles originally occupy, and provide a means of adding randomness to the starting positions of vehicles. In order to introduce a small initial disturbance to the system of vehicles in the network, we set the `perturbation` term in `InitialConfig` to 1m.

In [27]:
from flow.core.params import InitialConfig
initial_config_param = ENV["INITIAL_CONFIG_PARAMS"]
print(initial_config_param)

initial_config = InitialConfig(**initial_config_param)

{'spacing': 'random', 'perturbation': 50}


### 2.5 TrafficLightParams

`TrafficLightParams` are used to describe the positions and types of traffic lights in the network. These inputs are outside the scope of this tutorial, and instead are covered in `exercise06_traffic_lights.ipynb`. For our example, we create an empty `TrafficLightParams` object, thereby ensuring that none are placed on any nodes.

In [28]:
from flow.core.params import TrafficLightParams

traffic_lights = TrafficLightParams()

## 3. Setting up an Environment

Several envionrments in Flow exist to train autonomous agents of different forms (e.g. autonomous vehicles, traffic lights) to perform a variety of different tasks. These environments are often network or task specific; however, some can be deployed on an ambiguous set of networks as well. One such environment, `AccelEnv`, may be used to train a variable number of vehicles in a fully observable network with a *static* number of vehicles.

In [29]:
# from flow.envs.nemodrive_lab.env1_lab import LaneChangeAccelEnv1
env_name = ENV["ENVIRONMENT"]
print(env_name)

<class 'flow.envs.nemodrive_lab.env2_lab.LaneChangeAccelEnv2'>


Although we will not be training any autonomous agents in this exercise, the use of an environment allows us to view the cumulative reward simulation rollouts receive in the absence of autonomy.

Envrionments in Flow are parametrized by three components:
* `EnvParams`
* `SumoParams`
* `Network`

### 3.1 SumoParams
`SumoParams` specifies simulation-specific variables. These variables include the length a simulation step (in seconds) and whether to render the GUI when running the experiment. For this example, we consider a simulation step length of 0.1s and activate the GUI.

Another useful parameter is `emission_path`, which is used to specify the path where the emissions output will be generated. They contain a lot of information about the simulation, for instance the position and speed of each car at each time step. If you do not specify any emission path, the emission file will not be generated. More on this in Section 5.

In [30]:
from flow.core.params import SumoParams

sumo_params = SumoParams(sim_step=0.1, render=True, emission_path='data', restart_instance=True)

### 3.2 EnvParams

`EnvParams` specify environment and experiment-specific parameters that either affect the training process or the dynamics of various components within the network. Much like `NetParams`, the attributes associated with this parameter are mostly environment specific, and can be found in the environment's `ADDITIONAL_ENV_PARAMS` dictionary.

In [31]:
# from flow.envs.nemodrive_lab.env1_lab import ADDITIONAL_ENV1_PARAMS
ADDITIONAL_ENV_PARAMS = ENV["ADDITIONAL_ENV_PARAMS"]

print(ADDITIONAL_ENV_PARAMS)

{'max_accel': 3, 'max_decel': 3, 'lane_change_duration': 0, 'target_velocity': 10, 'sort_vehicles': False, 'forward_progress_gain': 0.1, 'collision_reward': -1, 'lane_change_reward': -0.1, 'frontal_collision_distance': 2.0, 'lateral_collision_distance': 3.0, 'action_space_box': False, 'pos_noise_std': [0.5, 2], 'pos_noise_steps_reset': 100, 'speed_noise_std': [0.2, 0.8], 'acc_noise_std': [0.2, 0.4]}


Importing the `ADDITIONAL_ENV_PARAMS` variable, we see that it consists of only one entry, "target_velocity", which is used when computing the reward function associated with the environment. We use this default value when generating the `EnvParams` object.

In [32]:
from flow.core.params import EnvParams

env_params = EnvParams(additional_params=ADDITIONAL_ENV_PARAMS, horizon=ENV["HORIZON"])

## 4. Setting up and Running the Experiment
Once the inputs to the network and environment classes are ready, we are ready to set up a `Experiment` object.

In [33]:
from flow.core.experiment import Experiment

These objects may be used to simulate rollouts in the absence of reinforcement learning agents, as well as acquire behaviors and rewards that may be used as a baseline with which to compare the performance of the learning agent. In this case, we choose to run our experiment for one rollout consisting of 3000 steps (300 s).

**Note**: When executing the below code, remeber to click on the    <img style="display:inline;" src="img/play_button.png"> Play button after the GUI is rendered.

In [34]:
# create the network object
network = network_name(name="ring_example",
                       vehicles=vehicles,
                       net_params=net_params,
                       initial_config=initial_config,
                       traffic_lights=traffic_lights)



In [17]:
# create the environment object
sumo_params.render = True
env = env_name(env_params, sumo_params, network)

# create the experiment object
exp = Experiment(env)
_ = exp.run(1, 3000, convert_to_csv=True)


Error during start: Traceback (most recent call last):
  File "/home/victor/Documents/AAIT/lab1/flow/flow/core/kernel/simulation/traci.py", line 158, in start_simulation
    traci_connection.setOrder(0)
  File "/home/victor/anaconda3/envs/flow/lib/python3.6/site-packages/traci/connection.py", line 348, in setOrder
    self._sendExact()
  File "/home/victor/anaconda3/envs/flow/lib/python3.6/site-packages/traci/connection.py", line 99, in _sendExact
    raise FatalTraCIError("connection closed by SUMO")
traci.exceptions.FatalTraCIError: connection closed by SUMO

Error during start: Traceback (most recent call last):
  File "/home/victor/Documents/AAIT/lab1/flow/flow/core/kernel/simulation/traci.py", line 158, in start_simulation
    traci_connection.setOrder(0)
  File "/home/victor/anaconda3/envs/flow/lib/python3.6/site-packages/traci/connection.py", line 348, in setOrder
    self._sendExact()
  File "/home/victor/anaconda3/envs/flow/lib/python3.6/site-packages/traci/connection.py", lin

KeyboardInterrupt: 

Run still agent.

In [None]:
sumo_params.render = True
env = env_name(env_params, sumo_params, network)

# create the experiment object
exp = Experiment(env)

rl_actions = lambda state: [0, 0]

_ = exp.run(1, 3000, convert_to_csv=True, rl_actions=rl_actions)

Run random agent.

Use __FullExperiment__ to test agent that expects _state, reward, done, info_.

In [17]:
from flow.core.experiment_with_reward import FullExperiment
import numpy as np

class RandomAgent():
    def __init__(self, env):
        self.action_space = env.action_space
        self.max_decel = env.env_params.additional_params["max_decel"]
        self.max_accel = env.env_params.additional_params["max_accel"]
        self.max_speed = env.net_params.additional_params['speed_limit']
        self.change_lane_step_freq = 1
        self.num_steps = 0
        self.prev_speed = None
        
    def split_state(self, state):
        return np.split(state, 3)
    
    def change_lane(self, current_lane):
        if current_lane > 0.25:
            return 0
        else:
            return 2
        
    def act(self, state, reward, done, info):
        speed, pos, lane = self.split_state(state)
        current_speed = speed[0] * self.max_speed
        d = 1
        if self.prev_speed is not None and self.prev_speed > current_speed:
            d = self.change_lane(lane[0])
        self.prev_speed = current_speed
#         print(f'State speed {speed[0]:.5f}, pos {pos[0]:.2f}, lane {lane[0]:.2f}, reward {reward:.2f}')
        self.num_steps += 1

        acc = self.max_accel
        action =  np.array([acc, d])
#         print(f'Action acc {acc:.5f}, lane change {d}')
        yield action

sumo_params.render = False
env = env_name(env_params, sumo_params, network)

exp = FullExperiment(env)

agent = RandomAgent(env)

_ = exp.run(10, 3000, convert_to_csv=True, rl_actions=agent.act)

Round 0, return: 2045.934041388867
Round 1, return: 3040.3029170426053
Round 2, return: 3942.550049239418
Round 3, return: 3474.2417063040443
Round 4, return: 4647.296146049075
Round 5, return: 3996.9543100739525
Round 6, return: 4382.0542963642265
Round 7, return: 2022.706060773645
Round 8, return: 4056.83268915989
Round 9, return: 4185.936077234212
Average, std return: 3579.480829362993, 881.0908793593906
Average, std speed: 6.257771848130913, 0.6050931947327768


In [18]:
'''
Best: max accel and change lane when speed is decreasing from last timestamp
Reset PID when speed decreases and change lane (try to reach max velocity again)


Max acc 2500 mean 600 std
PID 2200 mean 400 std

With steering
Max acc 3600 mean 880 std
PID 2600 mean 100 std

Final
Max accel Average, std return: 3495.8623019177276, 755.1837757284029
PID Average, std return: 3823.135477809101, 503.36998899210744

PID Second Run

Round 0, return: 3740.6555759515295
Round 1, return: 4115.766751729745
Round 2, return: 2833.4711519168195
Round 3, return: 2832.8820155768735
Round 4, return: 3803.7717450343916
Round 5, return: 3761.857234484111
Round 6, return: 3727.966912218524
Round 7, return: 3809.5319302063267
Round 8, return: 3660.0931808848077
Round 9, return: 2368.4143970948403
Average, std return: 3465.4410895097963, 541.2198206627497
Average, std speed: 6.517645961584906, 0.3936315772803137

'''


from flow.core.experiment_with_reward import FullExperiment
import numpy as np

KP = 10
KI = 0.3
KD = 100
PAST_K = 30

class PIDAgent():
    def __init__(self, env):
        self.action_space = env.action_space
        self.max_decel = env.env_params.additional_params["max_decel"]
        self.max_accel = env.env_params.additional_params["max_accel"]
#         self.target_velocity = env.env_params.additional_params["target_velocity"]
        self.target_velocity = env.net_params.additional_params['speed_limit']

#         print(env.net_params.additional_params)
#         print(help(env.network))
        self.max_speed = env.net_params.additional_params['speed_limit']
        self.current_target_velocity = self.target_velocity
        self.lanes = env.net_params.additional_params['lanes']
        self.past_delta_velocities = []
        self.Kp = KP
        self.Ki = KI
        self.Kd = KD
        self.past_deltas = PAST_K
        self.num_steps = 0
        self.prev_speed = None
        
    def split_state(self, state):
        return np.split(state, 3)
    
    def change_lane(self, current_lane):
        if current_lane == 1:
            return 0
        else:
            return 2
    
    def other_lane(self, current_lane):
        if current_lane == 1:
            return 0
        else:
            return 1
        
    def nearby_vehicles_on_lane(self, target_lane, current_position, other_positions, other_lanes, nearby = 0.05):
        for i in range(1, len(other_positions)):
             if other_lanes[i] == target_lane:
                if np.abs(other_positions[i] - current_position) < nearby:
                    return True
        return False
    
    def update_current_target_velocity(self, new_target):
        if not np.allclose(new_target, self.current_target_velocity):
            self.past_delta_velocities = []
        self.current_target_velocity = new_target
        
    def act(self, state, reward, done, info):
        self.num_steps += 1
        speed, pos, lane = self.split_state(state)
        current_speed = speed[0] * self.max_speed
        lane = (np.round(self.lanes * lane)).astype(np.int)
        current_lane = lane[0]
        current_position = pos[0]
        other_lane = self.other_lane(current_lane)
        d = 1
        if self.prev_speed is not None and self.prev_speed > current_speed:
            if self.nearby_vehicles_on_lane(current_lane, current_position, pos[1:], lane[1:], 0.025):
                if not self.nearby_vehicles_on_lane(other_lane, current_position, pos[1:], lane[1:], 0.025):
                    d = self.change_lane(lane[0])
                    self.update_current_target_velocity(self.target_velocity)
                else:
                    self.update_current_target_velocity(0)
            else:
                self.update_current_target_velocity(self.target_velocity)
        else:
            self.update_current_target_velocity(self.target_velocity)
            
        self.prev_speed = current_speed
        
        diff = self.current_target_velocity - current_speed
        self.past_delta_velocities.append(diff)
        self.past_delta_velocities = self.past_delta_velocities[- self.past_deltas:]
        p = diff
        i = sum(self.past_delta_velocities)
        der = 0
        if len(self.past_delta_velocities) > 2:
            der = self.past_delta_velocities[-1] - self.past_delta_velocities[-2]
        acc = self.Kp * p + self.Ki * i + self.Kd * der
#         print(f"{current_speed:.4f}/{self.current_target_velocity:.4f}, {p:.4f}, {len(self.past_delta_velocities)}, {i:.4f}, {der:.4f}, {reward:.4f}")
#         print(len(self.past_delta_velocities), self.past_delta_velocities[-1])
        acc = max(min(acc, self.max_accel), - self.max_decel)
#         acc = self.max_accel
        action =  np.array([acc, d])
        yield action

sumo_params.render = False
env = env_name(env_params, sumo_params, network)

exp = FullExperiment(env)

agent = PIDAgent(env)

_ = exp.run(10, 3000, convert_to_csv=True, rl_actions=agent.act)

Round 0, return: 3740.6555759515295
Round 1, return: 4115.766751729745
Round 2, return: 2833.4711519168195
Round 3, return: 2832.8820155768735
Round 4, return: 3803.7717450343916
Round 5, return: 3761.857234484111
Round 6, return: 3727.966912218524
Round 7, return: 3809.5319302063267
Round 8, return: 3660.0931808848077
Round 9, return: 2368.4143970948403
Average, std return: 3465.4410895097963, 541.2198206627497
Average, std speed: 6.517645961584906, 0.3936315772803137


In [36]:
'''
ENV 2

Round 0, return: 2465.5635179896326
Round 1, return: 2804.5172637262854
Round 2, return: 2926.9330722507043
Round 3, return: 2636.5917181581663
Round 4, return: 2834.5662420127655
Round 5, return: 2938.824138425596
Round 6, return: 2139.17540433827
Round 7, return: 2124.280556513759
Round 8, return: 2086.321631063388
Round 9, return: 2554.2001131978577
Average, std return: 2551.0973657676423, 319.1652396442611
Average, std speed: 7.764650203414684, 0.17084722447599657
'''


from flow.core.experiment_with_reward import FullExperiment
import numpy as np

KP = 10
KI = 0.3
KD = 100
PAST_K = 30

class PIDAgent():
    def __init__(self, env):
        self.action_space = env.action_space
        self.max_decel = env.env_params.additional_params["max_decel"]
        self.max_accel = env.env_params.additional_params["max_accel"]
#         self.target_velocity = env.env_params.additional_params["target_velocity"]
        self.target_velocity = env.net_params.additional_params['speed_limit']

#         print(env.net_params.additional_params)
#         print(help(env.network))
        self.max_speed = env.net_params.additional_params['speed_limit']
        self.current_target_velocity = self.target_velocity
        self.lanes = env.net_params.additional_params['lanes']
        self.past_delta_velocities = []
        self.Kp = KP
        self.Ki = KI
        self.Kd = KD
        self.past_deltas = PAST_K
        self.num_steps = 0
        self.prev_speed = None
        
    def split_state(self, state):
        return np.split(state, 3)
    
    def change_lane(self, current_lane):
        if current_lane == 1:
            return 0
        else:
            return 2
    
    def other_lane(self, current_lane):
        if current_lane == 1:
            return 0
        else:
            return 1
        
    def nearby_vehicles_on_lane(self, target_lane, current_position, other_positions, other_lanes, nearby = 0.05):
        for i in range(1, len(other_positions)):
             if other_lanes[i] == target_lane:
                if np.abs(other_positions[i] - current_position) < nearby:
                    return True
        return False
    
    def update_current_target_velocity(self, new_target):
        if not np.allclose(new_target, self.current_target_velocity):
            self.past_delta_velocities = []
        self.current_target_velocity = new_target
        
    def act(self, state, reward, done, info):
        self.num_steps += 1
        speed, pos, lane = self.split_state(state)
        current_speed = speed[0] * self.max_speed
        lane = (np.round(self.lanes * lane)).astype(np.int)
        current_lane = lane[0]
        current_position = pos[0]
        other_lane = self.other_lane(current_lane)
        d = 1
        if self.prev_speed is not None and self.prev_speed > current_speed:
            if self.nearby_vehicles_on_lane(current_lane, current_position, pos[1:], lane[1:], 0.025):
                if not self.nearby_vehicles_on_lane(other_lane, current_position, pos[1:], lane[1:], 0.025):
                    d = self.change_lane(lane[0])
                    self.update_current_target_velocity(self.target_velocity)
                else:
                    self.update_current_target_velocity(0)
            else:
                self.update_current_target_velocity(self.target_velocity)
        else:
            self.update_current_target_velocity(self.target_velocity)
            
        self.prev_speed = current_speed
        
        diff = self.current_target_velocity - current_speed
        self.past_delta_velocities.append(diff)
        self.past_delta_velocities = self.past_delta_velocities[- self.past_deltas:]
        p = diff
        i = sum(self.past_delta_velocities)
        der = 0
        if len(self.past_delta_velocities) > 2:
            der = self.past_delta_velocities[-1] - self.past_delta_velocities[-2]
        acc = self.Kp * p + self.Ki * i + self.Kd * der
#         print(f"{current_speed:.4f}/{self.current_target_velocity:.4f}, {p:.4f}, {len(self.past_delta_velocities)}, {i:.4f}, {der:.4f}, {reward:.4f}")
#         print(len(self.past_delta_velocities), self.past_delta_velocities[-1])
        acc = max(min(acc, self.max_accel), - self.max_decel)
#         acc = self.max_accel
        action =  np.array([acc, d])
        yield action

sumo_params.render = False
env = env_name(env_params, sumo_params, network)

exp = FullExperiment(env)

agent = PIDAgent(env)

_ = exp.run(10, 3000, convert_to_csv=True, rl_actions=agent.act)

Round 0, return: 2465.5635179896326
Round 1, return: 2804.5172637262854
Round 2, return: 2926.9330722507043
Round 3, return: 2636.5917181581663
Round 4, return: 2834.5662420127655
Round 5, return: 2938.824138425596
Round 6, return: 2139.17540433827
Round 7, return: 2124.280556513759
Round 8, return: 2086.321631063388
Round 9, return: 2554.2001131978577
Average, std return: 2551.0973657676423, 319.1652396442611
Average, std speed: 7.764650203414684, 0.17084722447599657



Feel free to experiment with all these problems and more!

## Bibliography
[1] Sugiyama, Yuki, et al. "Traffic jams without bottlenecks—experimental evidence for the physical mechanism of the formation of a jam." New journal of physics 10.3 (2008): 033001.

[2] Treiber, Martin, Ansgar Hennecke, and Dirk Helbing. "Congested traffic states in empirical observations and microscopic simulations." Physical review E 62.2 (2000): 1805.

## 5 Setting up Flow Parameters

RLlib experiments both generate a `params.json` file for each experiment run. For RLlib experiments, the parameters defining the Flow network and environment must be stored as well. As such, in this section we define the dictionary `flow_params`, which contains the variables required by the utility function `make_create_env`. `make_create_env` is a higher-order function which returns a function `create_env` that initializes a Gym environment corresponding to the Flow network specified.

In [17]:
# Creating flow_params. Make sure the dictionary keys are as specified. 
sumo_params.render = False
sumo_params.print_warnings=False
flow_params = dict(
    # name of the experiment
    exp_tag=name,
    # name of the flow environment the experiment is running on
    env_name=env_name,
    # name of the network class the experiment uses
    network=network_name,
    # simulator that is used by the experiment
    simulator='traci',
    # sumo-related parameters (see flow.core.params.SumoParams)
    sim=sumo_params,
    # environment related parameters (see flow.core.params.EnvParams)
    env=env_params,
    # network-related parameters (see flow.core.params.NetParams and
    # the network's documentation or ADDITIONAL_NET_PARAMS component)
    net=net_params,
    # vehicles to be placed in the network at the start of a rollout 
    # (see flow.core.vehicles.Vehicles)
    veh=vehicles,
    # (optional) parameters affecting the positioning of vehicles upon 
    # initialization/reset (see flow.core.params.InitialConfig)
    initial=initial_config
)

## 4 Running RL experiments in Ray

### 4.1 Import 

First, we must import modules required to run experiments in Ray. The `json` package is required to store the Flow experiment parameters in the `params.json` file, as is `FlowParamsEncoder`. Ray-related imports are required: the PPO algorithm agent, `ray.tune`'s experiment runner, and environment helper methods `register_env` and `make_create_env`.

In [18]:
import json

import ray
try:
    from ray.rllib.agents.agent import get_agent_class
except ImportError:
    from ray.rllib.agents.registry import get_agent_class
from ray.tune import run_experiments
from ray.tune.registry import register_env

from flow.utils.registry import make_create_env
from flow.utils.rllib import FlowParamsEncoder

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### 4.2 Initializing Ray
Here, we initialize Ray and experiment-based constant variables specifying parallelism in the experiment as well as experiment batch size in terms of number of rollouts.

In [19]:
# number of parallel workers
N_CPUS = 4
# number of rollouts per training iteration
N_ROLLOUTS = 20

ray.init(num_cpus=N_CPUS)

2019-12-09 11:59:07,343	INFO node.py:498 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-12-09_11-59-07_342522_493/logs.
2019-12-09 11:59:07,462	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:56251 to respond...
2019-12-09 11:59:07,588	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:28768 to respond...
2019-12-09 11:59:07,595	INFO services.py:809 -- Starting Redis shard with 3.34 GB max memory.
2019-12-09 11:59:07,626	INFO node.py:512 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-12-09_11-59-07_342522_493/logs.
2019-12-09 11:59:07,630	INFO services.py:1475 -- Starting the Plasma object store with 5.01 GB memory using /dev/shm.


{'node_ip_address': '192.168.0.101',
 'redis_address': '192.168.0.101:56251',
 'object_store_address': '/tmp/ray/session_2019-12-09_11-59-07_342522_493/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2019-12-09_11-59-07_342522_493/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2019-12-09_11-59-07_342522_493'}

### 4.3 Configuration and Setup
Here, we copy and modify the default configuration for the [PPO algorithm](https://arxiv.org/abs/1707.06347). The agent has the number of parallel workers specified, a batch size corresponding to `N_ROLLOUTS` rollouts (each of which has length `HORIZON` steps), a discount rate $\gamma$ of 0.999, two hidden layers of size 16, uses Generalized Advantage Estimation, $\lambda$ of 0.97, and other parameters as set below.

Once `config` contains the desired parameters, a JSON string corresponding to the `flow_params` specified in section 3 is generated. The `FlowParamsEncoder` maps objects to string representations so that the experiment can be reproduced later. That string representation is stored within the `env_config` section of the `config` dictionary. Later, `config` is written out to the file `params.json`. 

Next, we call `make_create_env` and pass in the `flow_params` to return a function we can use to register our Flow environment with Gym. 

In [20]:
# The algorithm or model to train. This may refer to "
#      "the name of a built-on algorithm (e.g. RLLib's DQN "
#      "or PPO), or a user-defined trainable function or "
#      "class registered in the tune registry.")
alg_run = "PPO"
HORIZON = 100

agent_cls = get_agent_class(alg_run)
config = agent_cls._default_config.copy()
config["num_workers"] = N_CPUS - 1  # number of parallel workers
config["train_batch_size"] = HORIZON * N_ROLLOUTS  # batch size
config["gamma"] = 0.999  # discount rate
config["model"].update({"fcnet_hiddens": [16, 16]})  # size of hidden layers in network
config["use_gae"] = True  # using generalized advantage estimation
config["lambda"] = 0.97  
config["sgd_minibatch_size"] = min(16 * 1024, config["train_batch_size"])  # stochastic gradient descent
config["kl_target"] = 0.02  # target KL divergence
config["num_sgd_iter"] = 500  # number of SGD iterations
config["horizon"] = HORIZON  # rollout horizon

# save the flow params for replay
flow_json = json.dumps(flow_params, cls=FlowParamsEncoder, sort_keys=True,
                       indent=4)  # generating a string version of flow_params
config['env_config']['flow_params'] = flow_json  # adding the flow_params to config dict
config['env_config']['run'] = alg_run

# Call the utility function make_create_env to be able to 
# register the Flow env for this experiment
create_env, gym_name = make_create_env(params=flow_params, version=0)

# Register as rllib env with Gym
register_env(gym_name, create_env)

### 4.4 Running Experiments

Here, we use the `run_experiments` function from `ray.tune`. The function takes a dictionary with one key, a name corresponding to the experiment, and one value, itself a dictionary containing parameters for training.

In [21]:
trials = run_experiments({
    flow_params["exp_tag"]: {
        "run": alg_run,
        "env": gym_name,
        "config": {
            **config
        },
        "checkpoint_freq": 1,  # number of iterations between checkpoints
        "checkpoint_at_end": True,  # generate a checkpoint at the end
        "max_failures": 999,
        "stop": {  # stopping conditions
            "training_iteration": 500,  # number of iterations to stop after
        },
    },
})

2019-12-09 11:59:20,574	INFO trial_runner.py:176 -- Starting a new experiment.
2019-12-09 11:59:20,618	ERROR log_sync.py:34 -- Log sync requires cluster to be setup with `ray up`.


== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/1 GPUs
Memory usage on this node: 4.0/16.7 GB





== Status ==
Using FIFO scheduling algorithm.
Resources requested: 4/4 CPUs, 0/1 GPUs
Memory usage on this node: 4.0/16.7 GB
Result logdir: /home/victor/ray_results/FigureEightNetwork
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_LaneChangeAccelEnv1-v0_0:	RUNNING

[2m[36m(pid=536)[0m   _np_qint8 = np.dtype([("qint8", np.int8, 1)])
[2m[36m(pid=536)[0m   _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
[2m[36m(pid=536)[0m   _np_qint16 = np.dtype([("qint16", np.int16, 1)])
[2m[36m(pid=536)[0m   _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
[2m[36m(pid=536)[0m   _np_qint32 = np.dtype([("qint32", np.int32, 1)])
[2m[36m(pid=536)[0m   np_resource = np.dtype([("resource", np.ubyte, 1)])
[2m[36m(pid=536)[0m 2019-12-09 11:59:24,479	INFO rollout_worker.py:319 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors)
[2m[36m(pid=536)[0m 2019-12-09 11:59:24.480926: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supp

2019-12-09 13:09:19,294	ERROR import_thread.py:89 -- ImportThread: Error 111 connecting to 192.168.0.101:56251. Connection refused.
2019-12-09 13:09:19,295	ERROR worker.py:1716 -- listen_error_messages_raylet: Error 111 connecting to 192.168.0.101:56251. Connection refused.
2019-12-09 13:09:19,295	ERROR worker.py:1616 -- print_logs: Error 111 connecting to 192.168.0.101:56251. Connection refused.


KeyboardInterrupt: 