# Reinforcement Learning Solution to the Towers of Hanoi Puzzle

Damian Armijo

In [4]:
import neuralnetworks as nnQ
import random
import numpy as np

In [103]:
from statistics import mean 

In [303]:
from copy import copy
from copy import deepcopy

# Overview
This jupyterNotebook begins by displaying and describing the functions used to get the Qlearning algorithm to be used for the Towers of Hanoi problem. It then tests each of these functions to show that they are working. After this there is a section on the investigation of what inputs are best for this Q learning algortihm. Finally there is a section discussing the results found in the investigation section. 

# Functions
The following functions are used to implement the Q learning algorithm applied to the Towers of Hanoi problem.
Each function should have a brief description on how what it does and how it relates to the Q learning algorithm.

The following function is a helper function, which turns state and move into tuples, this is necessary for iterating over them in the Dictionary Q, it will return the "tuplefide" state and move pair.

In [309]:
def stateMoveTuple(state, move):
    tempTupState = []
    for x in state:
        tempTupState.append(tuple(x))        
    return (tuple(tempTupState),tuple(move))

The validMoves function takes in a state checks to see what moves can be made from the current state, it checks the tops of each of the columns and sees if this top can move to either of the other two columns. 

In [13]:
def validMoves(state):
    validMoves = []
    for x in range(1,4):
        for y in range(1,4):
            if len(state[x-1]) > 0:
                topx = state[x-1]
                if len(state[y-1]) !=0:
                    topy = state[y-1]
                if len(state[y-1]) == 0:
                    validMoves.append([x,y])
                elif((topx != topy and topy > topx)):
                    validMoves.append([x,y])
                elif((topx != topy and topy > topx)):
                    validMoves.append([x,y])
                elif((topx != topy and topy > topx)):
                    validMoves.append([x,y])

    return validMoves                

The printState function simply prints out the state given to it in a format that is nice to read. It gives a visual of the columns and the "rings" on them. 

In [380]:
def printState(state):
    temp = 0
    col1Len = 0
    col2Len = 0
    col3Len = 0
    big = len(max(state))
    while temp < len(max(state)):
        holder1,holder2,holder3 = " "," "," "
        
        if col1Len< len(state[0]) and col1Len !=big and col1Len == temp:
            holder1 = state[0][temp]
            col1Len = col1Len + 1
        if col2Len < len(state[1]) and col2Len == temp:
            holder2 = state[1][temp]
            col2Len = col2Len + 1
        if col3Len < len(state[2])and col3Len == temp:
            holder3 = state[2][temp]
            col3Len = col3Len + 1
            
        print(holder1,holder2,holder3)
        temp = temp+1
    print("------")

The makeMove function returns a copy of the given state after it has taken the move which is given to it.

In [302]:
def makeMove(state, move):
    state2 = deepcopy(state)
    temp = state2[move[0]-1][0]
    state2[move[0]-1].pop(0)
    state2[move[1]-1].insert(0,temp)
    return state2

The trainQ function is what actually creates and trains the Q dictionary (the State,Action dictionary). It goes through the given amount of reputations, and updates the value of the Q dictionary based on the choice decided from the epsilonFindMoves(greedyEpsilon). It also decays the epsilon given to the epsilonFindMoves function. 

In [345]:
def trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF):
    Q = {}
    
    state = [[1,2,3],[],[]]
    epsilon = 1
    step_count = []
    
    for x in range(nRepetitions):
        
        epsilon = epsilonDecayFactor * epsilon
        step = 0
        goal_state = [[],[],[1,2,3]]
        done = False
        
        while done != True:
            
            step = step + 1
            next_move = epsilonFindMoves(Q, state, epsilon, validMovesF)
            #print(state,next_move)
            next_state = makeMove(state,next_move) 
            
            if stateMoveTuple(state,next_move) not in Q:
                Q[stateMoveTuple(state,next_move)] = 0
                
            if next_state == goal_state:
                Q[stateMoveTuple(state,next_move)] = 1
                done = True
                
            if step > 1:
                Q[stateMoveTuple(stateOld,moveOld)] += learningRate *(1+Q[stateMoveTuple(state,next_move)]-
                                                        Q[stateMoveTuple(stateOld,moveOld)])
            stateOld, moveOld = state, next_move
            state = next_state
        state = [[1,2,3],[],[]]
        step_count.append(step)
    return Q ,step_count

The testQ function takes in the Q dictionary that was trained, and picks the most greedy option towards the goal. It takes each move and puts it in a list, this list is then returned when the goal is reached. 

In [355]:
def testQ(Q, maxSteps, validMovesF, makeMovesF):
    
    state = [[1,2,3],[],[]]
    epsilon = 0
    path = []
    #path.append(state)
    
    step = 0
    goal_state = [[],[],[1,2,3]]
    done = False

    while done != True:
        step = step+1
        path.append(state)
        next_move = epsilonFindMoves(Q, state, epsilon, validMovesF)

        next_state = makeMove(state,next_move) 
        

        if next_state == goal_state:
            done = True
        if step > maxSteps:
            done = True
        state = next_state
    return path

The epsilonFindMoves(...) function takes in the Q dictionary and finds which of the valid moves from state is most likely the best choice.

In [301]:
def epsilonFindMoves(Q, state, epsilonRate, validMovesF):
    validMoveList = validMoves(state)
    if np.random.uniform()<epsilonRate:
        return validMoveList[np.random.choice(len(validMoveList))] 
    else:
        Qs = np.array([Q.get(stateMoveTuple(state, m), 0) for m in validMoveList])
        return validMoveList[np.argmin(Qs)]

In [369]:
for x in path:
    printState(x)

1    
2    
3    
------
2   1
3    
------
3 2 1
------
3 1  
------
  1 3
------
1 2 3
------
1   2
    3
------


# Testing

In [378]:
state = [[1,2,3],[],[]]
move = [1,2]

In [377]:
#testing stateMoveTuple(state,move)
print(stateMoveTuple(state,move))

(((1, 2, 3), (), ()), (1, 2))


In [376]:
#testing validMoves(state)
print(validMoves(state))

[[1, 2], [1, 3]]


In [383]:
#testing printState(state)
printState(state)

1    
2    
3    
------


In [384]:
#testing makeMove(state,move)
print(makeMove(state,move))

[[2, 3], [1], []]


In [399]:
#testing trainQ(...)
Q, stepsToGoal = trainQ(50, 0.5, 0.7, validMoves, makeMove)
print('State Actions in Q:' ,len(Q))
print("Progression of steps to goal:")
print(np.array(stepsToGoal))
print('Mean of steps:',mean(steps))

State Actions in Q: 76
Progression of steps to goal:
[ 95  37 237  19  23  45  26  23   7  31  14  44   8  26  10   9   8  14
   7  32  11   7  16   7   7   7   7   7   9   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7]
Mean of steps: 7.578


In [410]:
#testing testQ(...)
Q, stepsToGoal = trainQ(100, 0.5, 0.7, validMoves, makeMove)
path = testQ(Q, 20, validMoves, makeMove)
print("Path for trained Q to Goal:")
for s in path:
    printState(s)

Path for trained Q to Goal:
1    
2    
3    
------
2   1
3    
------
3 2 1
------
3 1  
------
  1 3
------
1 2 3
------
1   2
    3
------


In [403]:
path

[[[1, 2, 3], [], []],
 [[2, 3], [], [1]],
 [[3], [2], [1]],
 [[3], [1, 2], []],
 [[], [1, 2], [3]],
 [[1], [2], [3]],
 [[1], [], [2, 3]]]

## Further investigation

### Control

In [445]:
Q, stepsToGoal = trainQ(500, 0.5, 0.7, validMoves, makeMove)
print('500 Repetitions, .5 learning rate, .7 decay rate')
print('State Actions in Q:' ,len(Q))
print("Progression of steps to goal:")
print(np.array(stepsToGoal[:100]))
print('Mean of steps:',mean(stepsToGoal))

500 Repetitions, .5 learning rate, .7 decay rate
State Actions in Q: 76
Progression of steps to goal:
[107  36 198  12  35  18  41  61  24  55   9  17  24   9   9   9  10  36
  38   7   7  18   7   7  11   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7]
Mean of steps: 8.26


### Repetitions

In [436]:
Q, stepsToGoal = trainQ(5, 0.5, 0.7, validMoves, makeMove)
print('5 Repetitions')
print('State Actions in Q:' ,len(Q))
print("Progression of steps to goal:")
print(np.array(stepsToGoal))
print('Mean of steps:',mean(stepsToGoal))

5 Repetitions
State Actions in Q: 76
Progression of steps to goal:
[ 87 119  51  44  15]
Mean of steps: 63.2


In [437]:
Q, stepsToGoal = trainQ(500, 0.5, 0.7, validMoves, makeMove)
print('500 Repetitions')
print('State Actions in Q:' ,len(Q))
print("Progression of steps to goal:")
print(np.array(stepsToGoal[:100]))
print('Mean of steps:',mean(stepsToGoal))

500 Repetitions
State Actions in Q: 76
Progression of steps to goal:
[ 51  74 108  55  23  50  15  24  50  18  54  10  16   8  27   7  28  33
   9  12   7   7   9  24  11  23   7   7   7   7   8   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7]
Mean of steps: 8.144


In [438]:
Q, stepsToGoal = trainQ(10000, 0.5, 0.7, validMoves, makeMove)
print('10000 Repetitions')
print("Progression of steps to goal:")
print(np.array(stepsToGoal[:100]))
print('Mean of steps:',mean(stepsToGoal))

10000 Repetitions
Progression of steps to goal:
[118  47 118  52  16  34  16  45  36  58  10   7  26   9  15  30  12  10
  15   9  15   7   7   8   7   7  21   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7]
Mean of steps: 7.0566


### Learning Rate

In [439]:
Q, stepsToGoal = trainQ(500, .99 , 0.7, validMoves, makeMove)
print('500 Repetitions, .99 learning rate')
print("Progression of steps to goal:")
print(np.array(stepsToGoal[:100]))
print('Mean of steps:',mean(stepsToGoal))
Q, stepsToGoal = trainQ(500, .01 , 0.7, validMoves, makeMove)
print('500 Repetitions, .01 learning rate')
print("Progression of steps to goal:")
print(np.array(stepsToGoal[:100]))
print('Mean of steps:',mean(stepsToGoal))

500 Repetitions, .99 learning rate
Progression of steps to goal:
[ 24  91 125  21  37  37  22  13  16   7  11   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7]
Mean of steps: 7.654
500 Repetitions, .01 learning rate
Progression of steps to goal:
[  28  208   90  751  407  561  325  238  903  756  137 3281  188  108
   52  125   65   79  123   75  107   57  167   27   97  104   49  113
   24  118   49   90   50  116   69   63  147   22   80   55   78   46
   86  124   27  121   13   70   70   57   62   45   88   50   71  118
   16   73   84   65   42   81   35   46  118   21  122   38   35   46
   86   22   32  106   18   78   59   40   46   63   62   47   93   30
   41   69   45 

### Epsilon decay rate

In [440]:
Q, stepsToGoal = trainQ(500, 0.5, 0.99, validMoves, makeMove)
print('500 Repetitions, .99 decay')
print("Progression of steps to goal:")
print(np.array(stepsToGoal[:100]))
print('Mean of steps:',mean(stepsToGoal))
Q, stepsToGoal = trainQ(500, 0.5, 0.01, validMoves, makeMove)
print('500 Repetitions, .01 decay')
print("Progression of steps to goal:")
print(np.array(stepsToGoal[:100]))
print('Mean of steps:',mean(stepsToGoal))

500 Repetitions, .99 decay
Progression of steps to goal:
[ 99 137  68  92 209 253 243 203  66  34  36 105 278  23  53  26  67  36
  81  26  70  54  10  58  26  21  46  96  12  31  49  18  40  19  15  55
  21  41  15  13  29  24  34  23  27  20  23  15   9  28  16  44  11  15
  11  14   8  33   9  13  11  19  39  16  19  12  17  22  11  36  18   9
  23  12  10  20  13  27  21  16  29  24  24  21  13  11  13  30  16  10
  16  10  28  10   8  14  18  22   8  12]
Mean of steps: 14.328
500 Repetitions, .01 decay
Progression of steps to goal:
[ 66  42  86 116  26  26  21  40  27  29  12  22  45  16  15  11  10  36
  19   7   7  10  10  42  10   8   7   7   7   7   7  10   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
   7   7   7   7   7   7   7   7   7   7]
Mean of steps: 8.16


### Best of 3 options

In [442]:
Q, stepsToGoal = trainQ(10000, 0.99, 0.01, validMoves, makeMove)
print('10000 Repetitions, .99 learning rate, .01 decay')
print("Progression of steps to goal:")
print(np.array(stepsToGoal[:100]))
print('Mean of steps:',mean(stepsToGoal))

10000 Repetitions, .99 learning rate, .01 decay
Progression of steps to goal:
[66 42 62 36 80 47 11 15 10 24  7 14  7  7  7  7  7  7  7  7  7  7  7  7
  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7
  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7
  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7
  7  7  7  7]
Mean of steps: 7.033


# Discussion of Results

There were some very interesting things that I found when testing the trainQ function with different inputs. When testing with different repetitions it was very clear to see that there is a clear benefit to doing more repetitions. Doing only 5 repetitions meant that it took many more than 7 steps to get to the goal, and doing 500 repetitions always reached 7 steps. It didn't seem however that doing an excessive amount of repetitions made it quicker to get to 7 steps. 

I also tested with differing learning rates both .99 and .01. This did seem to make a big impact on the "learning". The .99 learning rate was able to very quickly lower it down to only 7 moves, it was faster than the control. This was completely the opposite of the .01 learning rate, which only a couple times randomly got 7 moves.

I then tested different epsilon decay rates .99 and .01. This did seem to make an impact the decay rate of .01 did reach doing 7 moves at about the same rate as the control, whereas .99 took a while to reach 8 moves. Changing decay rate didn't seem to have as much a positive impact as it would have a negative one when compared to the control.

After doing this, I took the best input option from all the different testings, and it did significantly better than the control. The control had an average of 8.26 and took about 25 repetitions to reach a constant of 7 moves. This is much less effective when compared to the best of all the options I tested, that one had an average of 7.033 moves, and took only 12 to reach a constant of 7 moves. 