# Reinforcement Learning

In order to train an RL agent, we need to have (i) an environment and (ii) a learning method. In this work, we define a foraging environment where the goal of the agent is to find as many targets as possible in a given time. We consider environments with non-destructive -or replenishable- targets, which we implement by displacing the agent a distance $l_\textrm{c}$ from the center of the found target.

As for the agent, we use Projective Simulation (PS) to model its decision making process and learning method. However, other algorithms that work with stochastic policies can also be used.

First, we import the classes that define the environment (`TargetEnv`), the forager dynamics (`Forager`), and its learning method.

In [None]:
import numpy as np

from projective_simulation.agents.foraging import Forager
from projective_simulation.envs.foraging import TargetEnv
from tqdm.notebook import tqdm

Note: the class `Forager` as it currently is inherits the methods of a PS agent for decision making and learning. However, other learning algorithms can be directly implemented by changing this inheritance. The learning algorithm should contain a method for decision making, called `deliberate`, which inputs a state; and another one for updating the policy, called `learn`, which inputs a reward.

We set up the parameters defining the length of the episodes (number of RL steps) and the number of episodes.

In [None]:
TIME_EP = 200 #time steps per episode
EPISODES = 1200 #number of episodes

We initialize the environment.

In [None]:
#Environment parameters
Nt = 100 #number of targets
L = 100 #world size
r = 0.5 #target detection radius
lc = np.array([[1.0],[1]]) #cutoff length

#Initialize environment
env = TargetEnv(Nt, L, r, lc)

We initialize the agent. As states, the agent perceives the value of an internal counter that keeps track of the number of small steps that it has performed without turning. The possible actions are continue walking in the same direction or turning. The agent performs a small step of length $d=1$ in any case after making a decision. Let's define the parameters of the PS forager agent and initialize it:

In [None]:
NUM_ACTIONS = 2 # continue in the same direction, turn
SIZE_STATE_SPACE = np.array([TIME_EP]) # one state per value that the counter may possibly have within an episode.
#--the last two entries are just placeholders here, but the code is general enough to implement ensembles of interacting agents that forage together.--
GAMMA = 0.00001 #forgetting parameter in PS
ETA_GLOW = 0.1 #glow damping parameter in PS

#set a different initialization policy
INITIAL_DISTR = np.ones((2, TIME_EP))
INITIAL_DISTR[0, :] = 0.99
INITIAL_DISTR[1, :] = 0.01
    

#Initialize agent
agent = Forager(num_actions=NUM_ACTIONS,
                size_state_space=SIZE_STATE_SPACE,
                gamma_damping=GAMMA,
                eta_glow_damping=ETA_GLOW,
                initial_prob_distr=INITIAL_DISTR)

We run the learning process.

In [None]:
for e in tqdm(range(EPISODES)):
        
    #restart environment and agent's counter and g matrix
    env.init_env()
    agent.agent_state = 0
    agent.reset_g()

    for t in range(TIME_EP):
        
        #step to set counter to its min. value n=1
        if t == 0 or env.kicked[0]:
            #do one step with random direction (no learning in this step)
            env.update_pos(1)
            #check boundary conditions
            env.check_bc()
            #reset counter
            agent.agent_state = 0
            #set kicked value to false again
            env.kicked[0] = 0
            
        else:
            #get perception
            state = agent.get_state()
            #decide
            action = agent.deliberate(state)
            #act (update counter)
            agent.act(action)
            
            #update positions
            env.update_pos(action)
            #check if target was found + kick if it is
            reward = env.check_encounter()
                
            #check boundary conditions
            env.check_bc()
            #learn
            agent.learn(reward)

  0%|          | 0/1200 [00:00<?, ?it/s]

> For more details, please look at the `rl_opts` repository ([link](https://github.com/gorkamunoz/rl_opts)), which was developed for this project, and from which we inherited all functions. Shortly that library will be deprecated and everything will be run from `projective_simulation`.