# Chutes and Ladders using Q-Learning
This notebook uses Q-Learning to solve the Chutes and Ladders modified game.

The game board is shown below.  Players start on Square 0 (outside the board) and move towards the goal space (State 100).  Landing at the bottom of ladder moves the player to the top, while landing at the top of a chute moves the player to the bottom square.  


<img src="https://drive.google.com/uc?export=view&id=16k2EflsluUXCPVWVgL-vyv5-IzOqWZvW" alt="Drawing" width="500"/>


At each turn, the player may choose one of four dice (Effron Dice)
- Blue: 3,3,3,3,3,3
- Black: 4,4,4,4,0,0
- Red: 6,6,2,2,2,2
- Green: 5,5,5,1,1,1

The **purpose** of this code is to determine which dice to select at each turn so as to minimize the number of steps it takes to reach the goal state.  

In [None]:
# import statements
import random
import numpy as np
import matplotlib.pyplot as plt


## State Transition Code

In [None]:
def nextState (state,roll):
    '''
    This function transitions from the current state and current dice roll to the next state.
    INPUTS:
        state is the current state you are in (0 to 100)
        roll is the number showing on the dice (1 to 6)
    RETURN VALUE:
    this function returns the next state integer
    '''
    # we create a dictionary for the ladders and chutes.  The key is the start state of the chute/ladder
    # and the value is the ending state.
    ladders = {1:38,4:14,9:31,21:42,28:84,36:44,51:67,71:91,80:100}
    chutes = {16:6,48:26,49:11,56:53,62:19,64:60,87:24,93:73,95:75,98:78}

    next_state = state + roll
    if next_state > 100:
        next_state = 100
    # now check for ladders
    if next_state in ladders:
        next_state = ladders[next_state]
    # now check for chutes
    if next_state in chutes:
        next_state = chutes[next_state]

    return next_state

def roll (dice_color):
    '''
    This function randomly rolls one of the four effron dice.
    INPUT:
    dice_color should be among "red","blue","black", or "green"
    OUTPUT:
    an integer randomly selected from one of the dice
    '''

    if dice_color == 'red':
        return random.choice([2,2,2,2,6,6])
    if dice_color == 'blue':
        return 3
    if dice_color == 'black':
        return random.choice([0,0,4,4,4,4])
    if dice_color == 'green':
        return random.choice([1,1,1,5,5,5])
    # for invalid input
    return None



# Q-Learning Approach

Below, we will implement a Q-Learning model that plays our modified version of Chutes and Ladders. Above each cell is a brief description of what the code in the cell does. Generally, we will be training our model across ten 1,000 game trials, updating our policy at the conclusion of each. Afterward, we will test our model by playing 2,000 games using only our policy in order to get an accurate measure of the average number of moves it takes to complete a game.

The `egreedy` function returns the action our model will take on each turn, which will either be the move our policy suggests or a random move, depending on the result of an RNG. There is a 10% chance of a random move being taken.

In [None]:
def egreedy(epsilon, s, policy, debug):
    a = 0
    epsChoice = (random.random() <= epsilon) # If true, will take a random action instead of best action
    if epsChoice: # random action
        a = random.randint(0,3)
        if debug:
            print("Performing a random action, a =", a)
    else: # pick best action based on Q
        a = policy[s]
        if debug:
            print("Performing best known action, a =", a)
    return a

The `updatePolicy` function updates our policy. It is run before each training epoch. This function updates our policy by finding the minimum value move at each state using the model's Q-matrix.

In [None]:
def updatePolicy(Q):
    """
    Updates the model's policy based on Q
    Inputs: Q = Q-matrix
    Outputs: new policy
    """

    V = np.min(Q, axis=1)
    P = np.argmin(Q, axis=1)
    return V,P

The ``Q-Learning`` method plays through one game of modified Chutes and Ladders, adjusting the Q-matrix after each move. There is a random chance (10%) of the model making a random action instead of the best known action according to the Q-matrix in order to encourage exploration. The model learns at a rate of 0.1, giving the result of each move a slight influence over the value of each state.

This function will only ever be run from inside of the ``QLTrain`` function.

In [None]:
def QLearning(Q, policy, aOpts, alpha, epsilon, debug):
    """
    Plays 1 game of Chutes and Ladders to train the model
    Inputs: Q = Q-matrix
            policy = the current policy of the model
            aOpts = a list of action choices
            alpha = alpha value
            epsilon = epsilon value
            debug = True to print out debug messages
    Outputs: Q at end of game
    """

    s = 0 # state, start @ square 0
    total_reward = 0
    reward = 0
    mov_count = 0
    while s != 100:
        if debug:
            print("Move", str(mov_count) + ":")
        a = egreedy(epsilon, s, policy, debug)

        # Get next state
        rollVal = roll(aOpts[a])
        sP = nextState(s, rollVal) # s'
        reward = 1

        total_reward += reward

        # Debug stuff
        if debug:
            print("Result: Rolled", aOpts[a], "die with result", rollVal, "moving from square", s, "to square", (s + rollVal))
            if (s + rollVal) == sP:
                print("No chutes or ladders encountered, final square = " + str(sP) + ", reward =", reward)
            elif (s + rollVal) < sP:
                print("Ladder encountered, final square = " + str(sP) + ", reward =", reward)
            else:
                print("Chute encountered, final square = " + str(sP) + ", reward =", reward)
        prevQ = Q[s][a]
        prevS = s

        # Q update calculations
        aP = policy[sP] # a'
        TDerror = reward + Q[sP][aP] - Q[s][a]
        Q[s][a] = Q[s][a] + alpha*TDerror
        s = sP

        if debug:
            print("Previous Q-value at state", prevS, "choosing the", aOpts[a], "die =", prevQ)
            print("New Q-value at state", prevS, "choosing the", aOpts[a], "die =", Q[s][a])
            print()
        mov_count += 1

    # update terminal state (unnecessary?)
    reward = 0 # I guess
    total_reward += reward
    TDerror = reward + 0 - Q[s][a]
    Q[s][a] = Q[s][a] + alpha * TDerror

    return total_reward

The ``QLTrain`` function is used to train the Q-Learning model. To do this it plays ``n`` games of modified Chutes and Ladders.

In [None]:
def QLTrain(Q, policy, aOpts, n, alpha, epsilon, debug=False):
    """
    Trains a model using QLearning
    Inputs: Q = Q-Matrix
            policy = policy of the model
            aOpts = a list of action choices
            numEpochs = number of training epochs
            alpha = alpha value
            epsilon = epsilon value
            debug = True to print debug messages (False by default)
    Outputs: Q at end of training
    """

    R = 0   # total reward
    for i in range(n):
        R += QLearning(Q, policy, aOpts, alpha, epsilon, debug)
    return R / n

The ``QLPlay`` function will play a game of modified Chutes and Ladders based only on its Q-matrix. This is run after training the model to illustrate results. ``PrintGame`` prints a formatted log of the game.

In [None]:
def QLPlay(aOpts, policy, debug=False, printLog=False):
    """
    Plays a game of Modified Chutes and Ladders without updating Q-matrix.
    Prints a log of actions (if debug is True) and the path taken through the game.
    Inputs: aOpts = a list of action choices
            policy = Q-matrix
            debug = True to print debug messages
    Outputs: Returns number of moves and a log of actions and states, prints a log of the game to the console.
    """

    if debug:
        print("Game Debug Log:")
    sLog = []
    aLog = []
    s = 0
    mov_count = 0
    while s != 100:
        a = policy[s]
        sLog.append(s)
        if debug:
            print("Move " + str(mov_count) + ":")
        aLog.append(a)
        rollVal = roll(aOpts[a])
        sP = nextState(s, rollVal) # s'

        # Debug stuff
        if debug:
            print("Selected the", aOpts[a], "die (Q-value = " + str(Q[s][a]) + "). Rolled a " + str(rollVal) + ".")
            if (s + rollVal) == sP:
                print("Moving from square", s, "to square " + str(s + rollVal) + ".")
            elif (s + rollVal) < sP:
                print("There was a ladder on square " + str((s+rollVal)) + "! It went to square " + str(sP) + "!")
            else:
                print("There was a chute on square " + str((s + rollVal)) + "! It went to square " + str(sP) + "!")
            if (s < sP):
                print("This action resulted in a net movement of", (sP - s), "squares forward.")
            elif (s == sP):
                print("This action resulted in no movement.")
            else:
                print("This action resulted in a net movement of", (sP - s), "squares backward.")

        s = sP
        mov_count += 1

    if debug:
        print()
        print()

    if printLog:
        PrintGame(aOpts, sLog, aLog) # Game Result Output
    return(mov_count)

In [None]:
def PrintGame(aOpts, sLog, aLog):
    """
    Prints a log of a game described by sLog and aLog
    Inputs: aOpts = a list of action choices
            sLog = log of all states in order
            aLog = log of action choices in order
    Outputs: Returns nothing, prints to console
    """

    print("Log:")
    for i in range(len(sLog)):
        if i == len(sLog)-1:
            print("Move " + str(i + 1) + ": Chose " + aOpts[aLog[i]] + " die, square " + str(sLog[i]) + " -> square 100")
        else:
            print("Move " + str(i + 1) + ": Chose " + aOpts[aLog[i]] + " die, square " + str(sLog[i]) + " -> square " + str(sLog[i+1]))

The below code trains and then tests our model as described at the start of this section. It will print the performance (the number of average number of moves to finish the game) for each trial of training and the final Q-matrix, list of values (V list), and policy after the final trial.

In [None]:
m = 10
n = 1000

aOpts = ["red", "blue", "black", "green"]
policy = np.zeros(101, dtype=int)
Q = np.full((101,4), 0.0, dtype=float) # Initialize to 1
epsilon = 0.1
alpha = 0.1

for i in range(m):
    print("\n*** Trial {0:d} ***".format(i))
    R = QLTrain(Q, policy, aOpts, n, alpha, epsilon, debug=False)
    V,policy = updatePolicy(Q)
    print("Performance:", R)

print()
print()

print("FINAL Q MATRIX")
print("=============================")
print("Note: Rows of 0s in the matrix are a result of those squares being the beginning of a chute or ladder path.")
print()
print(Q)
print()
print()

print("FINAL V LIST")
print("=============================")
print("Shows the value of the optimal action at each square.")
print()
print(V)
print()
print()

print("FINAL POLICY")
print("=============================")
print("Each item is the die to select at the corresponding square (starting from 0 up to 100.")
print("Dice key: 0 = Red, 1 = Blue, 2 = Black, 3 = Green")
print()
print(policy)
print()
print()


*** Trial 0 ***
Performance: 29.675

*** Trial 1 ***
Performance: 21.535

*** Trial 2 ***
Performance: 63.863

*** Trial 3 ***
Performance: 12.494

*** Trial 4 ***
Performance: 12.858

*** Trial 5 ***
Performance: 12.479

*** Trial 6 ***
Performance: 11.972

*** Trial 7 ***
Performance: 12.191

*** Trial 8 ***
Performance: 12.122

*** Trial 9 ***
Performance: 12.104


FINAL Q MATRIX
Note: Rows of 0s in the matrix are a result of those squares being the beginning of a chute or ladder path.

[[ 12.99832244  15.75437403  10.99959319  10.38577398]
 [  0.           0.           0.           0.        ]
 [ 11.70915444  67.86501892  90.17178692 130.69128513]
 [ 97.04464436  31.30097125  14.82858201  43.08457474]
 [  0.           0.           0.           0.        ]
 [ 13.02212805  12.95276793   9.98521397  13.50169473]
 [ 12.39788008  61.27333024  89.53651846  85.99924716]
 [ 75.50555649  13.2750994   81.8679767   61.24875124]
 [ 18.16586695  11.85195546  15.17342498  77.39890758]
 [  0.   

The `assess` function will conduct a series of non-learning trials and evaluate the value of the policy. In this case, that value would be the average number of moves it takes to complete the game across all trials.

In [None]:
def assess(policy, aOpts, numTrials):
    """
    Runs numTrials non-learning trials and evaluates the value of the policy.
    Inputs: policy = policy of the model
            numTrials = number of non-learning trials to run
    Outputs: value of the policy (avg. moves to win)
    """
    totalMovesAllTrials = 0
    for i in range(numTrials):
        totalMovesAllTrials += QLPlay(aOpts, policy, debug=False, printLog=False)
    return float(totalMovesAllTrials)/float(numTrials)

Finally, we will use the `assess` function to evaluate the value of our policy after training.

In [None]:
print("Average moves per game:", assess(policy, aOpts, 2000))

Average moves per game: 10.453


This measure of average moves can vary some, but generally gives results of approximately 10.5 moves per game on average. These results are quite promising, as the average game of Chutes and Ladders takes 39.2 turns (using the same values as we have on these dice), with the best possible game taking 7 turns. Accounting for bad rolls, this is quite a good result.