# Minimal Reinforcement Learning in R

This notebook builds a minimal reinforcement learning (RL) example using **Q-learning** in R.

Goal: learn a policy that moves an agent to a target state as efficiently as possible.

## Environment Setup

We use a simple 1D world with 5 states: `1, 2, 3, 4, 5`.

- Start state: `1`
- Terminal (goal) state: `5`
- Actions: `left`, `right`
- Transition: deterministic (left decreases state by 1; right increases by 1, bounded in `[1, 5]`)
- Reward: `+1` when reaching state `5`, otherwise "left" gains `-0.02` and "right" gains `-0.04`

This setting is tiny but enough to illustrate the RL ingredients: states, actions, rewards, and policy learning.

In [308]:
# Environment definition
n_states <- 5
goal_state <- 5
actions <- c("left", "right")

# Transition and reward function
step_env <- function(state, action) {
  if (action == "left") {
    next_state <- max(1, state - 1)
  } else if (action == "right") {
    next_state <- min(n_states, state + 1)
  }

  done <- (next_state == goal_state)
  reward <- if (done) 1 else if (action == "left") -0.02 else -0.04

  return ( list(next_state = next_state, reward = reward, done = done ) )
}

## Policy and Learning Rule

We store action values in a Q-table `Q[state, action]`.

- **Behavior policy (for training):** epsilon-greedy
  - with probability `epsilon`, choose a random action (explore)
  - otherwise choose the action with highest current Q-value (exploit)
- **Update rule (Q-learning):**

$$Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a')\right]$$

where `alpha` is the learning rate and `gamma` is the discount factor.

In [None]:
# Hyperparameters

# alpha <- 1.0 # macro uses alpha = 1.0
alpha <- 0.8 # statistic practice
gamma <- 0.9
epsilon <- 0.2
n_episodes <- 20
max_steps <- 10


# Q is a 5-by-2 matrix
# Q-table: rows are states, columns are actions
Q <- matrix(0, nrow = n_states, ncol = length(actions),
            dimnames = list(state = 1:n_states, action = actions))

# mapping from (state, Q) -> action
choose_action <- function(state, Q, epsilon) {
  if (runif(1) < epsilon) { # exploration
    sample(actions, 1) # action = (left, right)
  } else { # exploitation
    actions[ which.max( Q[state, ] ) ]
    # give "state", take the action that maximize 
  }
}


The matrix `Q` will evolve over episodes.

In [310]:
# Training loop
episode_returns <- numeric(n_episodes)
QQ <- matrix(0, n_states, n_episodes)

for (ep in 1:n_episodes) {
  state <- 1 # start with state == 1

  for (t in 1:max_steps) { # 30 steps of maximum

    action <- choose_action(state, Q, epsilon) # exploration vs exploitation
    out <- step_env(state, action) 
    # outcome given the state and action in this step

    s_next <- out$next_state
    r <- out$reward
    done <- out$done # collect info from outcome

    # if (ep == n_episodes ) 
    # cat("step", t, "with reward", r, "with state", s_next, "\n") # report

    a_idx <- match(action, actions) # action id
    target <- r + gamma * max(Q[s_next, ])
    # this period reward + pervious max over actions

    Q[state, a_idx] <- (1 - alpha) * Q[state, a_idx] + alpha * target
    # update either left or right of Q
    # Q is updated in every step

    state <- s_next

    if (done) break
  }
  print(Q)

}



     action
state    left right
    1 -0.0542 -0.04
    2 -0.0560 -0.04
    3 -0.0542  0.00
    4  0.0000  0.00
    5  0.0000  0.00
     action
state    left  right
    1 -0.0542 -0.076
    2 -0.0560 -0.040
    3 -0.0542 -0.040
    4 -0.0560  1.000
    5  0.0000  0.000
     action
state      left  right
    1 -0.081902 -0.076
    2 -0.056000 -0.076
    3 -0.054200  0.860
    4 -0.056000  1.000
    5  0.000000  0.000
     action
state       left   right
    1 -0.0937118 -0.1084
    2 -0.0937118  0.7340
    3 -0.0542000  0.8600
    4 -0.0560000  1.0000
    5  0.0000000  0.0000
     action
state       left  right
    1 -0.1139066 0.6206
    2 -0.0937118 0.7340
    3 -0.0542000 0.8600
    4 -0.0560000 1.0000
    5  0.0000000 0.0000
     action
state       left  right
    1 -0.1139066 0.6206
    2 -0.0937118 0.7340
    3  0.6406000 0.8600
    4 -0.0560000 1.0000
    5  0.0000000 0.0000
     action
state       left  right
    1 -0.1139066 0.6206
    2 -0.0937118 0.7340
    3  0.6406000 0.860

## Extract the Learned Policy

After training, the greedy policy picks the action with the largest Q-value in each state.

In [311]:
greedy_policy <- apply(Q, 1, function(q_row) actions[which.max(q_row)])

policy_table <- data.frame(
  state = 1:n_states,
  action = greedy_policy,
  row.names = NULL
)

policy_table

state,action
<int>,<chr>
1,right
2,right
3,right
4,right
5,left


## One Greedy Episode (Evaluation)

Now we run one episode using the learned greedy policy (`epsilon = 0`) to see the behavior.

In [312]:
state <- 1
trajectory <- data.frame(step = 0, state = state, action = NA, reward = 0)

for (t in 1:10) {
  action <- actions[which.max(Q[state, ])]
  out <- step_env(state, action)

  trajectory <- rbind(
    trajectory,
    data.frame(step = t, state = out$next_state, action = action, reward = out$reward)
  )

  state <- out$next_state
  if (out$done) break
}

trajectory

step,state,action,reward
<dbl>,<dbl>,<chr>,<dbl>
0,1,,0.0
1,2,right,-0.04
2,3,right,-0.04
3,4,right,-0.04
4,5,right,1.0


The learned policy should mostly choose `right` in states `1` to `4`, moving quickly to the goal state `5`.