Minimax Q-Learning with Deep Neural Network for Connect-4
Prior to implementing Connect-4, I tried to apply simulating a Q-value function with a neural network that takes the current board state as input and output the corresponding value function. However, in both Tetris and 2048, the model became overly complex due to imperfect information (probabilistic computation of next-state). I have also previously had notable success with Tic Tac Toe, so I decided to model a relatively simple system: Connect-4, in which the number of possible actions are constrained, and the next state is perfectly known.
-
Implement Game Logic
- Placing Step
- Evaluating Rewards
- Computing Terminal State
-
Implement Agent Logic
- SARSA or pure Q-Learning?
- Pure Q-Learning
- Experience Replay
- e-greedy agent
- simple board evaluation heuristic
- Validation with Tables in place of Neural Networks
- Save/Load Agent
- Generalizing Agent to learn for any game, given formatted input and output
- SARSA or pure Q-Learning?
-
Implement Testing Logic Either the agent should be exposed to interact with user input -- or some other form of AI to train against. Currently, it's essentially playing against itself.
- Compete Against User
- Implement Heuristic MiniMax Agent
- Handle SIGINT to save midway results
- Tracking the loss function
- Parameters :
By actually taking into advantage greater memory and utilizing Experience Replay as a means to separate the temporal correlation between training sequences, I was able to achieve a much more impressive learning -- an AI that actually outperforms the minimax AI.
I was also able to replicate the result several times.
1 corresponds to minimax agent winning; -1 corresponds to Q-net agent winning. It is quite apparent that the minimax agent performs much better than the Q-net agent.
1 corresponds to random agent winning; -1 corresponds to Q-net agent winning. The Q-net agent performs slightly better than the random agent, but not by an impressive amount.
I also attempted (knowing that the chances of it working is scarce) to implement a supervised q-learning agent, which didn't work at all, as expected.