# The Past: AlphaGo
Previous versions of AlphaGo used:
* Two Convolutional Neural Networks
    * Policy Network to output list of candidate moves with weights
    * Value Network to output win probability
* Markov Chain Tree Search rollouts to end of game
* CNNs trained using expert online games and self play

# Currently: AlphaZero
The newest AlphaZero doesn't use any human input and trains entirely *tabula rasa*. A single deep neural network is used for both the policy and the value:

## Basic Model
$(\textbf{p}, v) = f_\theta(s)$


$\theta$ is the neural network parameters

$v$ is the win probability of the result $z$

$s$ is the board state (position)

$\bf p$ is a vector of move probabilities, with components of actions $a$.


$p_a = Pr(a|s)$

$v = E[z|s]$


Then Monte Carlo Tree Search is used, where each simulation is run such that from state $s$ a move $a$ is selected with a low visit count, a high move probability, and a high value, according to the current neural network $f_\theta(s)$, and values are averaged back up the tree. The MCTS returns a vector $\pi$, representing a probability distribution over moves. 

## Training
Training is done with self-play reinforcement learning. Both player's moves are selected from the MCTS randomly according to $a_t \sim \pi_t$, then when the game is over, the terminal position $s_T$ is scored as $z$: $-1$ for a loss, $0$ for a tie, $+1$ for a win. Then neural network params are updated to minimize the error between predicted $v_t$ and actual $z$, and to maximize the similarity between the policy vector ${\bf p}_t$ and $\pi_t$. Concretely, this loss function is

$l = (z - v)^2 - \pi^\top log({\bf p}) + c||\theta||^2$

## Architecture
AlphaZero uses a deep residual convolutional neural network. Input is the 8x8 board with each square having 12 bits to represent the piece type (6 piece types x 2 colors). Multiply this by a depth of 8 to store previous positions, so en passant and repitions are implicitly represented in the network. Castling rights would be an additional redundant input. 

The output is then an 8x8 board, where every origin square has 73 possible moves, giving $8*8*73=4,672$ possible moves. Then illegal moves are masked out and the probabilities normalized. 

This requires massive hardward to train: 5,000 first generation TPUs for self play and 64 second generation TPUs to train the network. Once trained, it runs on 4 TPUs. 


# Our Version
Build something similar, but focus on being useful to humans, specifically through post-game analysis or mimicking a certain player's style. 


## The Plan
Build a similar policy/value neural net, but train it on amerature human games also using elo ratings as an additional input. Then, you can give move recommendations based on the player's best move, not the engine's best move. 

That is, the policy component can rollout the MCTS according to play that the player's elo would suggest, and then look at values for different moves based on this?

Furthermore, a policy network could be retrained using two players' games to give hypothetical positions move suggestions against a specific player. 