# Chutes and Ladders using Monte Carlo Learning
This notebook uses Monte Carlo to solve the Chutes and Ladders modified game.

The game board is shown below.  Players start on Square 0 (outside the board) and move towards the goal space (State 100).  Landing at the bottom of ladder moves the player to the top, while landing at the top of a chute moves the player to the bottom square.  


<img src="https://drive.google.com/uc?export=view&id=16k2EflsluUXCPVWVgL-vyv5-IzOqWZvW" alt="Drawing" width="500"/>


At each turn, the player may choose one of four dice (Effron Dice)
- Blue: 3,3,3,3,3,3
- Black: 4,4,4,4,0,0
- Red: 6,6,2,2,2,2
- Green: 5,5,5,1,1,1

The **purpose** of this code is to determine which dice to select at each turn so as to minimize the number of steps it takes to reach the goal state.  

In [None]:
# Import Necessary Libraries
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
from statistics import mean
from sklearn import svm
from matplotlib import cm
from scipy import linalg as la
import sys
import tensorflow as tf
import keras
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
from datetime import datetime
from packaging import version
import random

## State Transition Code

In [None]:
def nextState(state,roll):
    '''
    This function transitions from the current state and current dice roll to the next state.
    INPUTS:
        state is the current state you are in (0 to 100)
        roll is the number showing on the dice (1 to 6)
    RETURN VALUE:
    this function returns the next state integer
    '''
    # we create a dictionary for the ladders and chutes.  The key is the start state of the chute/ladder
    # and the value is the ending state.
    ladders = {1:38,4:14,9:31,21:42,28:84,36:44,51:67,71:91,80:100}
    chutes = {16:6,48:26,49:11,56:53,62:19,64:60,87:24,93:73,95:75,98:78}

    next_state = state + roll
    if next_state > 100:
        next_state = 100
    # now check for ladders
    if next_state in ladders:
        next_state = ladders[next_state]
    # now check for chutes
    if next_state in chutes:
        next_state = chutes[next_state]

    return next_state

def roll(dice_color):
    '''
    This function randomly rolls one of the four effron dice.
    INPUT:
    dice_color should be among "red","blue","black", or "green"
    OUTPUT:
    an integer randomly selected from one of the dice

    Note:
    red = 0
    blue = 1
    black = 2
    green = 3
    '''

    if dice_color == 0:
        return random.choice([2,2,2,2,6,6])
    if dice_color == 1:
        return 3
    if dice_color == 2:
        return random.choice([0,0,4,4,4,4])
    if dice_color == 3:
        return random.choice([1,1,1,5,5,5])
    # for invalid input
    return None



## Monte Carlo Learning Approach

Below, we impliment a Monte Carlo Learning model that playes our modified version of Chutes and Ladders. Monte Carlo learning is a relatively basic approach to Reinforcement Learning. Essentially, Monte Carlo learning is done by taking many different samples, or in this case trajectories, and averaging the rewards seen for each state. To be sure, hundreds, if not thousands, of samples, must be taken in order to get a realistic estimate. In this case, thousands of trajectories are taken through the Chutes and Ladders board.

To train our Monte Carlo, we have to use an assortment of algorithms. Remember, there are two "sections" of the learning process that take place:

1. Policy Evaluation: Includes playing the game and updating our Q matrix. Importantly, we freeze the policy, meaning we play the game and do not update our policy to allow the learning process to continue.

2. Policy Improvement: Includes updating our policy using our Q matrix, which is updated in the policy evaluation step.


Within each code chunk is a short description of the algorithm.

In [None]:
#-------------------------------------------------------------
def update (s,a):
  '''
  Compute next state given state s and action a
  s = number of dollars in pot
  return reward
  '''

  chosenRoll = roll(a)
  nextS = nextState(s, chosenRoll)

  return nextS, 1

#-------------------------------------------------------------
def e_greedy(s,policy,epsilon):
  '''
  Implements an e-greedy policy.
  With probability epsilon, it returns random action choice
  otherwise returns action choice specified by the policy

  s = current state
  policy = policy function (an array that is indexed by state)
  epsilon (0 to 1) a probability of picking exploratory random action
  '''
  r = np.random.random()
  if r > epsilon:
    return policy[s]
  else:
    return np.random.randint(0,4)

#-------------------------------------------------------------
def make_trajectory(policy,epsilon):
  '''
  Simulate one trajectory of experience
  Return list of tuples during trajectory
  Each tuple is (s,a,r) -> state / action / reward
  epsilon = probability of exploratory action
  '''
  traj = []
  s=0

  while (s < 100):
    a = e_greedy(s,policy,epsilon)
    s_prev = s
    s,r = update(s,a)
    #print(s)
    traj.append((s_prev,a,r))

    if s < 0:
      s = 0

  # final reward = state value, final action = 0 (meaningless)
  traj.append((s,0,0))
  return traj

#-------------------------------------------------------------
def init():
  '''
  Create totals, counts and policy defaults
  '''
  totals = np.zeros((101,4), dtype=int)
  counts = np.zeros((101,4),dtype=int)
  P = np.zeros(101, dtype=int)
  return totals,counts,P

#-------------------------------------------------------------
def policy_improvement(Q):
  '''
  Update value function V and policy P based on Q values
  '''
  V = np.min(Q,axis=1)
  P = np.argmin(Q,axis=1)
  return V,P

#-------------------------------------------------------------
def policy_evaluation(totals,counts,policy,n,epsilon):
  '''
  do n trajectories of learning
  and update the v/count arrays
  '''
  for i in range(n):
    t = make_trajectory(policy,epsilon)
    m = len(t)
    sum_r = np.zeros(m)
    sum_r[m-1] = t[-1][2]
    for j in range(m-2,-1,-1):

      sum_r[j] = sum_r[j+1] + t[j][2]

    for j in range(m):
      s,a,r = t[j]
      if s == 100:
        s = 99
      counts[s,a] += 1
      totals[s,a] += sum_r[j]

#-------------------------------------------------------------
def policy_iteration(totals,counts,policy,Q,n,m,epsilon):
  '''
  Perform n iterations of policy iteration
  using m trials (episodes) per policy update
  '''
  for i in range(n):
    Q = compute_Q(totals,counts)
    V,P = policy_improvement(Q)
    policy_evaluation(totals,counts,P,m,epsilon)

  Q = compute_Q(totals,counts)
  V,P = policy_improvement(Q)


#-------------------------------------------------------------
def compute_Q(totals,counts):
  '''
  Compute the Q values based on totals and counts (average)
  '''
  Q = np.zeros((101,4))
  for i in range(len(totals)):
    for a in range(4):
      if counts[i][a] > 0:

        Q[i][a] = round((totals[i][a] / counts[i][a]), 2)

      else:
        Q[i][a] = 0
  return Q

Below we will print a few sample trajectories as well as the average move for the few trajectories that we have seen.

Note: this algorithm does involve randomness. When you rune this algorithm you may see different results.

In [None]:
totals,counts,P = init() #initialize values
epsilon = 0.1 # 10% change of a random move being taken
sumt = 0
k = 10
for i in range(k):
  t = make_trajectory(P,epsilon)
  sumt += t[-1][-1]
  if k <= 10:
    print(t)
print("Average Action:",sumt/k)

[(0, 0, 1), (2, 0, 1), (14, 0, 1), (6, 0, 1), (12, 0, 1), (14, 0, 1), (6, 0, 1), (8, 0, 1), (10, 0, 1), (6, 0, 1), (8, 0, 1), (10, 0, 1), (12, 0, 1), (18, 0, 1), (20, 2, 1), (24, 0, 1), (26, 0, 1), (32, 0, 1), (34, 0, 1), (44, 0, 1), (46, 0, 1), (26, 0, 1), (84, 0, 1), (86, 0, 1), (92, 0, 1), (78, 0, 1), (84, 2, 1), (88, 0, 1), (90, 0, 1), (92, 0, 1), (78, 0, 1), (100, 0, 0)]
[(0, 0, 1), (6, 0, 1), (8, 0, 1), (10, 0, 1), (6, 0, 1), (8, 0, 1), (10, 1, 1), (13, 0, 1), (15, 0, 1), (42, 0, 1), (44, 0, 1), (50, 0, 1), (53, 0, 1), (55, 0, 1), (57, 0, 1), (63, 0, 1), (65, 0, 1), (67, 0, 1), (69, 0, 1), (75, 0, 1), (81, 1, 1), (84, 0, 1), (86, 0, 1), (88, 1, 1), (91, 0, 1), (97, 0, 1), (99, 0, 1), (100, 0, 0)]
[(0, 0, 1), (2, 0, 1), (8, 0, 1), (10, 0, 1), (12, 0, 1), (14, 0, 1), (20, 2, 1), (20, 0, 1), (22, 0, 1), (84, 0, 1), (86, 0, 1), (92, 0, 1), (78, 0, 1), (84, 0, 1), (86, 0, 1), (88, 0, 1), (90, 0, 1), (96, 0, 1), (100, 0, 0)]
[(0, 0, 1), (6, 0, 1), (8, 0, 1), (10, 0, 1), (12, 0, 1), (14

The below code trains and then tests our model as described at the start of this section. It will print the performance of our algorithm: Q-Matrix, V list, and policy after the final trial.

In [None]:
np.random.seed(seed=42) # if you would like to repeat same process

totals,counts,P = init() # initialize values

# run experiment on a larger basis
m = 10
n = 15000
epsilon = 0.1 # 10% change of a random move being taken
Q = compute_Q(totals,counts)
V,P = policy_improvement(Q)
policy_iteration(totals,counts,P,Q,n,m,epsilon)

Q = compute_Q(totals,counts)
V,P = policy_improvement(Q)
print('Q =\n',Q)
print('V =\n',V)
print('P =\n',P)



Q =
 [[1.6070e+01 1.6750e+01 1.5070e+01 1.2700e+01]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00]
 [1.5780e+01 1.4820e+01 1.6560e+01 1.7000e+01]
 [3.1140e+01 1.6050e+01 1.7310e+01 1.4830e+01]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00]
 [1.6430e+01 1.4780e+01 1.3050e+01 1.5740e+01]
 [1.5100e+01 1.5770e+01 1.7510e+01 1.7410e+01]
 [2.2960e+01 1.9030e+01 1.6790e+01 1.4960e+01]
 [1.7370e+01 1.6610e+01 1.7530e+01 1.3500e+01]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00]
 [1.9240e+01 2.3090e+01 1.8420e+01 1.4830e+01]
 [1.9230e+01 8.1610e+01 2.0140e+01 1.5330e+01]
 [1.7410e+01 1.5190e+01 1.7090e+01 1.3780e+01]
 [1.3470e+01 1.8380e+01 1.5640e+01 1.5090e+01]
 [1.5120e+01 9.8510e+01 1.9860e+01 1.7200e+01]
 [1.2080e+01 1.5900e+01 1.4650e+01 1.4260e+01]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00]
 [1.5400e+01 7.3780e+01 1.3690e+01 1.2110e+01]
 [1.1940e+01 1.8410e+01 1.2360e+01 1.4620e+01]
 [1.0930e+01 3.4520e+01 1.4400e+01 1.4790e+01]
 [1.1000e+01 9.3370e+01 1.3320e+01 1.4110e+01]
 [0.0000

The `assess` function will conduct a series of non-learning trials and evaluate the value of the policy. In this case, that value would be the average number of moves it takes to complete the game across all trials.

In [None]:
#-------------------------------------------------------------
def assess(policy,trials):
  '''
  Assess the value of the current policy by completing #trials
  using the specified policy (no e-greedy random actions)
  Does not accrue learning experience nor change policy
  '''

  policy_evaluation(totals,counts,policy,trials,0)
  Q = compute_Q(totals,counts)
  V,P = policy_improvement(Q)
  return V[0]

In [None]:
#value = assess(P,2000)
print("Average moves per game: " + str(assess(P,2000)))

Average moves per game: 12.67


Note: Before we analyze the results, it is essential to remember that each time this notebook is run, it is likely there will be a slightly different result.


The above policy takes an average of 12.05 moves to complete a game of Chutes and Ladders, which is relatively close to the optimal value of 8 moves. The Monte Carlo approach discovered some interesting moves. For example, on the first move, our policy tells us to roll the green die. This makes sense because if we land on the 1st state, we hit a ladder and are taken all the way to 38. There are some peculiar decisions as well. For example, on square 92, it tells us that we should roll the green die. This is interesting because the green die has $\frac{1}{2}$ chance of rolling a one, which would hit a chute and take us all the way to 72. But there is a $\frac{1}{2}$ chance of rolling a five, which would lead us to finish the game in 2 moves because after a five is rolled, we would land on state 97. From there, our policy would tell us to roll the blue die, which would take us to the end. That is an interesting choice to make, given the risk.


Although the number of moves is higher than some of the other models, the average number of moves needed to reach the end is by no means absurdly high. Monte Carlo remains an important topic in Reinforcement Learning because it is a useful approach to take on its own and is crucial in the implementation of the Q-Learning and SARSA algorithms discussed below.