# Reinforcement Learning Solution to the Towers of Hanoi Puzzle

For this assignment, we will use reinforcement learning to solve the [Towers of Hanoi](https://en.wikipedia.org/wiki/Tower_of_Hanoi) puzzle.  

To accomplish this, we will modify the code discussed in lecture for learning to play Tic-Tac-Toe.  Modify the code  so that it learns to solve the three-disk, three-peg
Towers of Hanoi Puzzle.  

Steps required to do this include the following:

  - Represent the state, and use it as a tuple as a key to the Q dictionary.
  - Make sure only valid moves are tried from each state.
  - Assign reinforcement of $-1$ to each move unless it is a move to the goal state, for which the reinforcement is $0$.  This represents the goal of finding the shortest path to the goal.



## Introduction

The disks have been named as 1, 2, and 3, with 1 being the smallest disk and 3 being the largest. The set of disks on a peg has been represented as a list of integers.  Then the state can be a list of three lists.

For example, the starting state with all disks being on the left peg would be `[[1, 2, 3], [], []]`.  After moving disk 1 to peg 2, we have `[[2, 3], [1], []]`.

To represent that move we just made, we can use a list of two peg numbers, like `[1, 2]`, representing a move of the top disk on peg 1 to peg 2.

## Requirements

The following functions have been used for this assignment:

   - `printState(state)`: prints the state in the form shown below
   - `validMoves(state)`: returns list of moves that are valid from `state`
   - `makeMove(state, move)`: returns new (copy of) state after move has been applied.
   - `trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF)`: train the Q function for number of repetitions, decaying epsilon at start of each repetition. Returns Q and list or array of number of steps to reach goal for each repetition.
   - `testQ(Q, maxSteps, validMovesF, makeMoveF)`: without updating Q, use Q to find greedy action each step until goal is found. Return path of states.
   - `stateMoveTuple(state, move)`: returns tuple of state and move.  
    


# Code

In [1]:
import numpy as np
import random
from random import choice
import copy
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
def printState(state):
    peg=[['','',''],['','',''],['','','']]
    for i in range (0,(len(state))):
        if len(state[i])==0:
            peg[i][0]=' '
            peg[i][1]=' '
            peg[i][2]=' '
        if len(state[i])==1:
            peg[i][0]=' '
            peg[i][1]=' '
            peg[i][2]=state[i][0]
        if len(state[i])==2:
            peg[i][0]=' '
            peg[i][1]=state[i][0]
            peg[i][2]=state[i][1]
        if len(state[i])==3:
            peg[i][0]=state[i][0]
            peg[i][1]=state[i][1]
            peg[i][2]=state[i][2]
    
    for i in range (0,3):
        print (peg[0][i],peg[1][i],peg[2][i])
    print ('------')
    return

In [3]:
state = [[3], [], [1,2]]
printState(state)

     
    1
3   2
------


In [4]:
def stateMoveTuple(state, move):
    newState=copy.deepcopy(state)
    newMove=copy.deepcopy(move)
    for i in range (0,(len(newState))):
        newState[i]=tuple(newState[i])
    return (tuple(newState),tuple(newMove))

In [5]:
def validMoves(state):
    move = []
    for i in range (0,3):
        if len(state[i])==0:
            continue
        else: 
            for j in range (0,3):
                if j == i:
                    continue
                else:
                    if len(state[j])==0:
                        move.append([i+1,j+1])
                        continue
                    elif state[i]<state[j]:
                        move.append([i+1,j+1])
    return move     

In [6]:
state=[[], [3, 2], [1]]
x=validMoves(state)
print(x)

[[2, 1], [3, 1], [3, 2]]


In [7]:
def makeMove(state,move):
    #print (state,move)
    tempState=copy.deepcopy(state)
    list(tempState)
    x=tempState[move[0]-1][0]
    del tempState[move[0]-1][0]
    tempState[move[1]-1]=[x]+tempState[move[1]-1]
    return tempState

In [8]:
def foundGoal(board):
    if board == [[],[],[1,2,3]]:
        return True
    else:
        return False

In [9]:
def epsilonGreedy(epsilon, Q, state):
    validMove = validMoves(state)
    if np.random.uniform() < epsilon:       
        return validMove[random.randint(0, len(validMove)-1)]
    else:
        # Greedy Move
        Qs = []
        for m in validMove:
            Qs.append(Q.get(stateMoveTuple(state,m), -1))
        #Qs = np.array([Q.get(stateMoveTuple(state, m), 0) for m in validMove]) 
        return validMove[np.argmax(np.asarray(Qs))]

## Function Description For TrainQ Function

This function was developed by Prof. Charles Anderson for CSU CS 440-Intro to AI and was further modified for this assignment. The initial code was developed for Tic-Tac Toe game and for this assignment we further edited the code to modify the reinforcement. In the this code, the following reinforcement has been provided:

    -A Reinforcement of 0 if the Goal is Found
    -A Reinforcement of -1 otherwise

The moves for every step are obtained from the Epsilon Greedy function (given above), which randomly chooses a move from given list of moves and returns the move with the maximum Q value. 

In [10]:
def trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMoves, makeMove):
    maxGames = nRepetitions
    rho = learningRate
    epsilonDecayRate = epsilonDecayFactor
    epsilon = 1
    Q = {}   
    stepSize = []
    result=[]
    for nGames in range(maxGames):
        epsilon *= epsilonDecayRate
        step = 0
        state = [[1,2,3], [], []] #  start state
        done = False
        showMoves = False
        while not done:        
            step += 1
            move = epsilonGreedy(epsilon, Q, state)
            stateNew = copy.deepcopy(state)
            stateNew = makeMove(state,move) 
            if foundGoal(stateNew):
                Q[stateMoveTuple(state, move)] = 0
                done = True
            if stateMoveTuple(state, move) not in Q:
                Q[stateMoveTuple(state, move)] = -1  # initial Q value for new state,move
            if step > 1:
                Q[stateMoveTuple(stateOld, moveOld)] += rho * (-1+Q[stateMoveTuple(state, move)] - Q[stateMoveTuple(stateOld, moveOld)])
            stateOld, moveOld = state, move 
            state = stateNew
        stepSize.append(step)
        #print("Game: {} took {} steps to find the solution.\n".format(nGames + 1, step))
        #result.append((nGames + 1, step))
    #aveSteps = 0
    #for game, steps in result:
        #aveSteps += steps
    #aveSteps = aveSteps / len(result)
    #print("The average number of steps for {} games was {}.\n".format(len(result), aveSteps))
    return Q, stepSize

In [11]:
trainQ(50, 0.5, 0.7, validMoves, makeMove)

({(((), (1,), (2, 3)), (2, 1)): -1.5,
  (((), (1,), (2, 3)), (2, 3)): 0,
  (((), (1, 2), (3,)), (2, 1)): -2.000000000008498,
  (((), (1, 2), (3,)), (2, 3)): -2.375,
  (((), (1, 2), (3,)), (3, 1)): -2.3125,
  (((), (1, 2, 3), ()), (2, 1)): -3.943359375,
  (((), (1, 2, 3), ()), (2, 3)): -4.25,
  (((), (1, 3), (2,)), (2, 1)): -4.3994140625,
  (((), (1, 3), (2,)), (2, 3)): -4.16796875,
  (((), (1, 3), (2,)), (3, 1)): -4.056640625,
  (((), (2,), (1, 3)), (2, 1)): -2.625,
  (((), (2,), (1, 3)), (3, 1)): -2.0,
  (((), (2,), (1, 3)), (3, 2)): -2.390625,
  (((), (2, 3), (1,)), (2, 1)): -4.12109375,
  (((), (2, 3), (1,)), (3, 1)): -3.796875,
  (((), (2, 3), (1,)), (3, 2)): -3.8515625,
  (((), (3,), (1, 2)), (2, 1)): -4.578125,
  (((), (3,), (1, 2)), (3, 1)): -4.16796875,
  (((), (3,), (1, 2)), (3, 2)): -4.2236328125,
  (((1,), (), (2, 3)), (1, 2)): -1.5,
  (((1,), (), (2, 3)), (1, 3)): 0,
  (((1,), (), (2, 3)), (3, 2)): -2.328125,
  (((1,), (2,), (3,)), (1, 2)): -2.0,
  (((1,), (2,), (3,)), (1, 

In [12]:
Q, stepsToGoal = trainQ(50, 0.5, 0.7, validMoves, makeMove)

In [13]:
stepsToGoal

[18,
 69,
 41,
 72,
 36,
 7,
 40,
 15,
 10,
 35,
 9,
 31,
 7,
 12,
 49,
 9,
 7,
 7,
 7,
 7,
 10,
 8,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 17,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7]

## Function Description of TestQ Function

This function was developed to return the path of states without updating Q. The Q value, previously generated from the TrainQ Function was used to find greedy action each step until goal is found. Since we used a reinforcement of -1 in goal is not found and reinforcement of 0 if the goal is found, we used the 'argmax' function. The move with the highest value of Q is selected as a valid move. This function is performed either until the puzzle is solved or until the Maximum Number of Steps is reached. 

In [14]:
def testQ(Q, maxSteps, validMovesF, makeMoveF):
    path=[]
    state=[[1,2,3],[],[]] #Start State
    path.append(state)
    done=False
    step=0
    
    while (not done and step<maxSteps)  :
        step+=1
        moves=validMovesF(state)
        for m in moves:
            Qs = np.array([Q.get(stateMoveTuple(state, m), 0) for m in moves]) 
        move = moves[np.argmax(Qs)]    
        newstate = makeMoveF(state, move)
        path.append(newstate)
        if foundGoal(newstate):
            done = True
        state = newstate
        
    return path

In [15]:
path = testQ(Q, 20, validMoves, makeMove)

In [16]:
path

[[[1, 2, 3], [], []],
 [[2, 3], [], [1]],
 [[3], [2], [1]],
 [[3], [1, 2], []],
 [[], [1, 2], [3]],
 [[1], [2], [3]],
 [[1], [], [2, 3]],
 [[], [1], [2, 3]],
 [[], [], [1, 2, 3]]]

In [17]:
for s in path:
    printState(s)
    print()

1    
2    
3    
------

     
2    
3   1
------

     
     
3 2 1
------

     
  1  
3 2  
------

     
  1  
  2 3
------

     
     
1 2 3
------

     
    2
1   3
------

     
    2
  1 3
------

    1
    2
    3
------



## Grading

Download and extract `A5grader.py` from [A5grader.tar](http://www.cs.colostate.edu/~anderson/cs440/notebooks/A5grader.tar).

In [18]:
%run -i A5grader.py


Testing validMoves([[1], [2], [3]])

--- 10/10 points. Correctly returned [[1, 2], [1, 3], [2, 3]]

Testing validMoves([[], [], [1, 2, 3]])

--- 10/10 points. Correctly returned [[3, 1], [3, 2]]

Testing makeMove([[], [], [1, 2, 3]], [3, 2])

--- 10/10 points. Correctly returned [[], [1], [2, 3]]

Testing makeMove([[2], [3], [1]], [1, 2])

--- 10/10 points. Correctly returned [[], [2, 3], [1]]

Testing   Q, steps = trainQ(1000, 0.5, 0.7, validMoves, makeMove).

--- 10/10 points. Q dictionary has correct number of entries.

--- 10/10 points. The mean of the number of steps is 8.43 which is correct.

Testing   path = testQ(Q, 20, validMoves, makeMove).

--- 20/20 points. Correctly returns path of length 8, less than 10.

Intro to AI Execution Grade is 80/80

 Remaining 20 points will be based on your text describing the trainQ and test! functions.

Intro to AI FINAL GRADE is __/100


## Extra Credit

### Towers of Hanoi with 4 Disks and 3 Pegs

Modify your code to solve the Towers of Hanoi puzzle with 4 disks instead of 3.  Name your functions

    - printState_4disk
    - validMoves_4disk
    - makeMove_4disk

Find values for number of repetitions, learning rate, and epsilon decay factor for which trainQ learns a Q function that testQ can use to find the shortest solution path.  Include the output from the successful calls to trainQ and testQ.

In [19]:
def printState_4disk(state):
    peg=[['','','',''],['','','',''],['','','','']]
    for i in range (0,(len(state))):
        if len(state[i])==0:
            peg[i][0]=' '
            peg[i][1]=' '
            peg[i][2]=' '
            peg[i][3]=' '
        if len(state[i])==1:
            peg[i][0]=' '
            peg[i][1]=' '
            peg[i][2]=' '
            peg[i][3]=state[i][0]
        if len(state[i])==2:
            peg[i][0]=' '
            peg[i][1]=' '
            peg[i][2]=state[i][0]
            peg[i][3]=state[i][1]
        if len(state[i])==3:
            peg[i][0]=' '
            peg[i][1]=state[i][0]
            peg[i][2]=state[i][1]
            peg[i][3]=state[i][2]
        if len(state[i])==4:
            peg[i][0]=state[i][0]
            peg[i][1]=state[i][1]
            peg[i][2]=state[i][2]
            peg[i][3]=state[i][3]
            
    for i in range (0,4):
        print (peg[0][i],peg[1][i],peg[2][i])
    print ('------')
    return

In [20]:
state=[[2,3],[1],[4]]
printState_4disk(state)

     
     
2    
3 1 4
------


In [21]:
def validMoves_4disk(state):
    move = []
    for i in range (0,3):
        if len(state[i])==0:
            continue
        else: 
            for j in range (0,3):
                if j == i:
                    continue
                else:
                    if len(state[j])==0:
                        move.append([i+1,j+1])
                        continue
                    elif state[i]<state[j]:
                        move.append([i+1,j+1])
    return move     

In [22]:
state=[[2,1],[4],[3]]
validMoves_4disk(state)

[[1, 2], [1, 3], [3, 2]]

In [23]:
def makeMove_4disk(state,move):
    #print (state,move)
    tempState=copy.deepcopy(state)
    list(tempState)
    x=tempState[move[0]-1][0]
    del tempState[move[0]-1][0]
    tempState[move[1]-1]=[x]+tempState[move[1]-1]
    return tempState

In [24]:
move=[1,2]
makeMove_4disk(state,move)

[[1], [2, 4], [3]]

#### TrainQ Function

In [25]:
def foundGoal_4disk(board):
    if board == [[],[],[1,2,3,4]]:
        return True
    else:
        return False
    
def trainQ_4disk(nRepetitions, learningRate, epsilonDecayFactor, validMoves, makeMove):
    maxGames = nRepetitions
    rho = learningRate
    epsilonDecayRate = epsilonDecayFactor
    epsilon = 1
    Q = {}   
    stepSize = []
    result=[]
    for nGames in range(maxGames):
        epsilon *= epsilonDecayRate
        step = 0
        state = [[1,2,3,4], [], []] #  start state
        done = False
        showMoves = False
        while not done:        
            step += 1
            move = epsilonGreedy(epsilon, Q, state)
            stateNew = copy.deepcopy(state)
            stateNew = makeMove(state,move) 
            if foundGoal_4disk(stateNew):
                Q[stateMoveTuple(state, move)] = 0
                done = True
            if stateMoveTuple(state, move) not in Q:
                Q[stateMoveTuple(state, move)] = -1  # initial Q value for new state,move
            if step > 1:
                Q[stateMoveTuple(stateOld, moveOld)] += rho * (-1+Q[stateMoveTuple(state, move)] - Q[stateMoveTuple(stateOld, moveOld)])
            stateOld, moveOld = state, move 
            state = stateNew
        stepSize.append(step)
        #print("Game: {} took {} steps to find the solution.\n".format(nGames + 1, step))
        #result.append((nGames + 1, step))
    #aveSteps = 0
    #for game, steps in result:
        #aveSteps += steps
    #aveSteps = aveSteps / len(result)
    #print("The average number of steps for {} games was {}.\n".format(len(result), aveSteps))
    return Q, stepSize

### Result
#### Case 1
Number of Repetitions: 70
Learning Rate: 0.5
Epsilon Decay Factor:0.5

In [26]:
Q, stepsToGoal = trainQ_4disk(100, 0.5, 0.7, validMoves_4disk, makeMove_4disk)

In [27]:
stepsToGoal

[390,
 145,
 284,
 93,
 360,
 168,
 57,
 109,
 178,
 25,
 83,
 176,
 57,
 204,
 32,
 50,
 66,
 38,
 85,
 27,
 128,
 34,
 36,
 20,
 44,
 119,
 37,
 85,
 23,
 47,
 17,
 34,
 16,
 119,
 31,
 20,
 56,
 18,
 33,
 15,
 36,
 22,
 15,
 20,
 19,
 20,
 59,
 16,
 15,
 15,
 15,
 66,
 15,
 15,
 17,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15]

### Result
#### Case 2
Number of Repetitions: 80
Learning Rate: 0.4
Epsilon Decay Factor:0.6

In [28]:
Q, stepsToGoal = trainQ_4disk(100, 0.4, 0.8, validMoves_4disk, makeMove_4disk)

In [29]:
stepsToGoal

[890,
 123,
 143,
 344,
 128,
 281,
 198,
 178,
 38,
 52,
 66,
 79,
 46,
 132,
 46,
 42,
 43,
 82,
 30,
 58,
 134,
 33,
 30,
 54,
 34,
 31,
 58,
 92,
 27,
 28,
 89,
 28,
 111,
 33,
 18,
 23,
 110,
 18,
 57,
 22,
 17,
 37,
 20,
 25,
 85,
 16,
 39,
 16,
 63,
 30,
 19,
 15,
 18,
 36,
 56,
 19,
 15,
 23,
 15,
 18,
 15,
 17,
 85,
 15,
 31,
 15,
 17,
 15,
 17,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15]

#### TestQ

In [30]:
def testQ_4disk(Q, maxSteps, validMovesF, makeMoveF):
    path=[]
    state=[[1,2,3,4],[],[]] #Start State
    path.append(state)
    done=False
    step=0
    
    while (not done and step<maxSteps)  :
        step+=1
        moves=validMovesF(state)
        for m in moves:
            Qs = np.array([Q.get(stateMoveTuple(state, m), 0) for m in moves]) 
        move = moves[np.argmax(Qs)]    
        newstate = makeMoveF(state, move)
        path.append(newstate)
        if foundGoal_4disk(newstate):
            done = True
        state = newstate
        
    return path

In [31]:
path = testQ_4disk(Q, 20, validMoves_4disk, makeMove_4disk)
path

[[[1, 2, 3, 4], [], []],
 [[2, 3, 4], [1], []],
 [[3, 4], [1], [2]],
 [[3, 4], [], [1, 2]],
 [[4], [3], [1, 2]],
 [[1, 4], [3], [2]],
 [[1, 4], [2, 3], []],
 [[4], [1, 2, 3], []],
 [[], [1, 2, 3], [4]],
 [[], [2, 3], [1, 4]],
 [[2], [3], [1, 4]],
 [[1, 2], [3], [4]],
 [[1, 2], [], [3, 4]],
 [[2], [1], [3, 4]],
 [[], [1], [2, 3, 4]],
 [[], [], [1, 2, 3, 4]]]

#### Solution Path

In [32]:
for s in path:
    printState_4disk(s)
    print()

1    
2    
3    
4    
------

     
2    
3    
4 1  
------

     
     
3    
4 1 2
------

     
     
3   1
4   2
------

     
     
    1
4 3 2
------

     
     
1    
4 3 2
------

     
     
1 2  
4 3  
------

     
  1  
  2  
4 3  
------

     
  1  
  2  
  3 4
------

     
     
  2 1
  3 4
------

     
     
    1
2 3 4
------

     
     
1    
2 3 4
------

     
     
1   3
2   4
------

     
     
    3
2 1 4
------

     
    2
    3
  1 4
------

    1
    2
    3
    4
------



### Result
Two cases were presented for the Optimal Solution for the this problem. The values for number of repetitions, learning rate, and epsilon decay factor for which trainQ learns a Q function that testQ can use to find the shortest solution path was given above.  The output with  successful calls to trainQ and testQ function were presented.