# Chutes and Ladders using SARSA

This notebook uses Q-Learning to solve the Chutes and Ladders modified game.

The game board is shown below.  Players start on Square 0 (outside the board) and move towards the goal space (State 100).  Landing at the bottom of ladder moves the player to the top, while landing at the top of a chute moves the player to the bottom square.  


<img src="https://drive.google.com/uc?export=view&id=16k2EflsluUXCPVWVgL-vyv5-IzOqWZvW" alt="Drawing" width="500"/>


At each turn, the player may choose one of four dice (Effron Dice)
- Blue: 3,3,3,3,3,3
- Black: 4,4,4,4,0,0
- Red: 6,6,2,2,2,2
- Green: 5,5,5,1,1,1

The **purpose** of this code is to determine which dice to select at each turn so as to minimize the number of steps it takes to reach the goal state (finish the game).  

In [None]:
import random
import numpy as np
import matplotlib.pyplot as plt

## State Transition Code

In [None]:
def nextState (state,roll):
    '''
    This function transitions from the current state and current dice roll to the next state.
    INPUTS:
        state is the current state you are in (0 to 100)
        roll is the number showing on the dice (1 to 6)
    RETURN VALUE:
    this function returns the next state integer
    '''
    # we create a dictionary for the ladders and chutes.  The key is the start state of the chute/ladder
    # and the value is the ending state.
    ladders = {1:38,4:14,9:31,21:42,28:84,36:44,51:67,71:91,80:100}
    chutes = {16:6,48:26,49:11,56:53,62:19,64:60,87:24,93:73,95:75,98:78}

    next_state = state + int(roll)
    if next_state > 100:
        next_state = 100
    # now check for ladders
    if next_state in ladders:
        next_state = ladders[next_state]
    # now check for chutes
    if next_state in chutes:
        next_state = chutes[next_state]

    return next_state, 1

def roll (dice_color):
    '''
    This function randomly rolls one of the four effron dice.
    INPUT:
    dice_color should be among "red","blue","black", or "green"
    OUTPUT:
    an integer randomly selected from one of the dice
    '''

    if dice_color == 'red':
        return random.choice([2,2,2,2,6,6])
    if dice_color == 'blue':
        return 3
    if dice_color == 'black':
        return random.choice([0,0,4,4,4,4])
    if dice_color == 'green':
        return random.choice([1,1,1,5,5,5])
    # for invalid input
    return None

# SARSA Approach

The SARSA algorithm utilizes an e-greedy method in order to choose the next action. This means that the choosen next action will be almost always the one that optimizes the expected future reward, while there is a small chance that the choosen action will be random. In doing so, the algorithm balances exploration vs exploitation.

In [None]:
def e_greedy(s,policy,epsilon):
  '''
  Implements an e-greedy policy.
  With probability epsilon, it returns random action choice
  otherwise returns action choice specified by the policy

  s = current state
  policy = policy function (an array that is indexed by state)
  epsilon (0 to 1) a probability of picking exploratory random action
  '''
  r = np.random.random()
  if r > epsilon:
    return policy[s]
  else:
    return np.random.randint(0,3)

#-------------------------------------------------------------
def init ():
  '''
  Create totals, counts and policy defaults
  '''
  Q = np.zeros((101,4))
  print(len(Q))
  P = np.ones(101).astype(int)
  print(len(P))
  return Q,P


Additionally, SARSA is considered an on-policy algorithm, as SARSA uses the Q value of the next state and the next action to update the Q value of the current state-action pair.

In [None]:
def SARSA (Q,policy,alpha,epsilon):
  '''
  Perform 1 unit of experience (1 trial, trajectory)
  using the SARSA learning algorithm
  '''
  k=0 #amount of turns
  state = 0
  action = e_greedy(state,policy,epsilon)
  total_reward = 0
  reward = 0

  #print("\n==== TRIAL ====")
  #print("state,action,reward: ({0:d},{1:d},{2:d})".format(state,action,reward))

  while (state < 100): #while we haven't reached the end of the board
    rollColor = ["red", "blue","black","green"]
    rollX = roll(rollColor[action]) #gets a roll by rolling a specific die specified by the action

    next_state,reward = nextState(state,rollX) #gets the next state and receives a reward of 1
    total_reward += reward

    next_action = e_greedy(next_state,policy,epsilon)
    TDerror = 1 + Q[next_state,next_action] - Q[state,action] #reward of 1 plus the difference in the cost to go from the new state (next_state) and the current state (state)
    Q[state,action] = Q[state,action] + alpha * TDerror #updating Q-values using the total difference error (difference in expected reward and given reward)
    state = next_state
    action = next_action
    #print("state,action,reward: ({0:d},{1:d},{2:d})".format(state,action,reward))
    k += 1

  # now we need to update last (terminal) state
  #print("state,action,reward: ({0:d},{1:d},{2:d})".format(state,action,reward))
  total_reward += reward
  TDerror = reward + 0 - Q[state,action]
  Q[state,action] = Q[state,action] + alpha * TDerror
  return total_reward

In [None]:
def policy_improvement(Q):
  '''
  Update value function V and policy P based on Q values
  '''
  V = np.min(Q,axis=1) #getting the minimum the average future expected reward at each state, these will become our V values.
  P = np.argmin(Q,axis=1) #updates the policy to roll the die that minimizes the average future expected reward.
  return V,P


#-------------------------------------------------------------
def do_trials (Q,policy,n,alpha,epsilon):
  '''
  Perform n trials of learning
  '''
  R = 0   # total reward
  for i in range(n):
    R += SARSA(Q,policy,alpha,epsilon)

  return R / n

#-------------------------------------------------------------




The code below runs SARSA 1000 times for 10 sets, and after each set it uses the "policy_improvement" function to update the policy.

In [None]:
Q,P = init()
m = 1000 #number of trials per set
n = 10 #number of sets
epsilon = 0.1
alpha = 0.1


for i in range(m):
  R = do_trials(Q,P,n,alpha,epsilon)
  V,P = policy_improvement(Q)
print("Q = \n", Q)
print("V = \n",V)
print("P = \n",P)



101
101
Q = 
 [[13.55486435 13.37922997 12.56970972 12.43456251]
 [ 0.          0.          0.          0.        ]
 [12.57396206 12.81092334 12.956534   12.83051733]
 [12.54692641 12.47401905 12.69709399 12.58394776]
 [ 0.          0.          0.          0.        ]
 [13.00901119 12.92047939 12.32109241 12.8332215 ]
 [12.58836957 11.2905401  12.65824039 11.74624341]
 [11.70384559 12.48224468 12.39295786 12.11455768]
 [12.36581806 12.61676076 12.32845847 11.90921536]
 [ 0.          0.          0.          0.        ]
 [12.269304   12.24331622 12.29788218 12.35083737]
 [12.49446279 12.41803868 12.6168803  12.48510565]
 [11.77244371 11.71783156 11.96767238 11.8060996 ]
 [11.67826197 12.19770289 12.02435538 11.87118663]
 [11.69308913 11.41954731 11.49813991 11.52560701]
 [11.0934377  11.24446018 11.11027633 11.21570179]
 [ 0.          0.          0.          0.        ]
 [10.41860199 10.59661812 10.72409712 10.5813799 ]
 [10.33788256 10.27464337 10.4584089  10.31274539]
 [ 9.87705213  9.

# Conclusion:

From the V values above, which represent the average expected reward for a given state, we can see that on average it will take us 11.95 rolls to reach the final state (win Chutes and Ladders) if we were to follow the policy derived by SARSA

In [None]:
line = "["
for i in range(len(V)):
    line += " " + str(V[i]) + ","


print(line[:-1], "]")

[ 12.434562511905654, 0.0, 12.573962064424991, 12.474019050281758, 0.0, 12.321092408557538, 11.290540097484627, 11.70384559248309, 11.909215356893002, 0.0, 12.243316217853248, 12.418038680702253, 11.717831555109784, 11.678261966397066, 11.419547306528596, 11.093437701931427, 0.0, 10.418601991066838, 10.274643372871363, 9.774877192216472, 9.702179360037837, 0.0, 8.608410879379903, 8.1557821801937, 8.20469646979396, 7.578994885316556, 8.736050894107445, 8.729272416709112, 0.0, 10.034432134451375, 9.897635576785756, 10.218842661706965, 10.011851512619465, 9.498948222153201, 9.721772363147167, 10.103310682990871, 0.0, 10.630711068955575, 10.53431462075852, 10.122495307434413, 9.973920665767249, 9.57288035809776, 9.351859696929955, 8.584285100055357, 8.445641983825352, 7.4482973225199816, 7.524185036340084, 7.992167818590519, 0.0, 0.0, 8.222896513450046, 0.0, 8.565129572013253, 9.651534955410224, 9.02348292722168, 9.159029623153582, 0.0, 8.576413954573123, 8.379901888424595, 8.1683994582844

In [None]:
strings = []
line = "["

dice = ["red", "blue", "black", "green"]
for i in range(len(P)):
    line += " " + dice[int(P[i])] + ","
    strings.append(dice[int(P[i])])

In [None]:
print(line[:-1], "]")

[ green, red, red, blue, red, black, blue, red, green, red, blue, blue, blue, red, blue, red, red, red, blue, blue, red, red, blue, green, black, blue, red, green, red, red, blue, green, black, blue, red, green, red, blue, blue, red, blue, red, blue, black, blue, red, green, black, red, red, green, red, blue, red, black, red, red, red, red, black, blue, red, red, red, red, blue, blue, black, blue, red, green, red, blue, blue, blue, green, black, blue, red, green, red, blue, red, red, green, blue, blue, red, red, red, green, blue, black, red, blue, red, green, blue, red, blue, red ]


In [None]:
for i in range(10):
  print(line)
  line = "["
  for j in range(10):
    line += " " + strings[j+(i*10)] + ","
  line += "]"

[ green, red, red, blue, red, black, blue, red, green, red, blue, blue, blue, red, blue, red, red, red, blue, blue, red, red, blue, green, black, blue, red, green, red, red, blue, green, black, blue, red, green, red, blue, blue, red, blue, red, blue, black, blue, red, green, black, red, red, green, red, blue, red, black, red, red, red, red, black, blue, red, red, red, red, blue, blue, black, blue, red, green, red, blue, blue, blue, green, black, blue, red, green, red, blue, red, red, green, blue, blue, red, red, red, green, blue, black, red, blue, red, green, blue, red, blue, red,
[ green, red, red, blue, red, black, blue, red, green, red,]
[ blue, blue, blue, red, blue, red, red, red, blue, blue,]
[ red, red, blue, green, black, blue, red, green, red, red,]
[ blue, green, black, blue, red, green, red, blue, blue, red,]
[ blue, red, blue, black, blue, red, green, black, red, red,]
[ green, red, blue, red, black, red, red, red, red, black,]
[ blue, red, red, red, red, blue, blue, black,