In [None]:
using Pkg
Pkg.add("InteractBase")
Pkg.add("CSSUtil")
Pkg.add("WebIO")
Pkg.add("JSExpr")
Pkg.add("Observables")
Pkg.add("AlphaGo")
Pkg.add("Flux")

# AlphaGo Zero - adversarial search with guesses from a neural network

<img src=go.jpg>

In Lecture 9, we have learned about multi-agent systems and adverserial games. A central algorithm in game playing is **MiniMax**.

MiniMax essentially searches the entire space of moves possible by both players at any given time till the game reaches an end. This is how it maximizes lifetime reward while the opponent is trying to minimize it. We saw an optimization called alpha-beta pruning for MiniMax which is a great improvement but only by a factor of 2. This might help in games such as chess (an AI agent can be twice as smart on the same computer now assuming it uses a plausible heuristic).

TicTacToe has $9! = 362880$ possible states, roughly half of them are valid states. This kind of state space can be fully searched by computers.

Some games such as Go are significantly more complicated: by one estimate there are:

```
2081681993819799846994786333448627702865224
5388453054842563945682092741961273801537852
5648451698519643907259916015628128546089888
314427129715319317557736620397247064840935
```

States! The dimensions of the search tree are enormous. The branching factor of Go is 150-200! It is ($10^{100}$) times more complex than chess. It has been a formidable challenge for AI. 

In March 2016 DeepMind's AlphaGo Lee became the first program to beat a human expert -- 18-time world champion Lee Sedol. (DeepBlue beat Gary Kasparov in chess in 1997 -- super human ability in Go took 19 years longer).

The key insight of this software was to use a neural network to learn the game by looking at how humans play (using plays from ameteur players on internet Go servers) it to begin with, but later by playing the program against itself.

In 19 Oct 2017 DeepMind unweiled a newer, more elegant version of AlphaGo -- the AlphaGo Zero. This version learnt the game all on its own through playing millions of games with its best self. In 3 days, AlphaGo Zero trained enough to beat the first version (AlphaGo Lee) 100 games to 0!

## AlphaGo Zero's algorithm

### Why neural networks?

**Q-learning** with neural networks:

- **Goal-orientation:** The agent's goal is to figure out what is the best move at any given state.
- Recall the $Q$ function we used for value iteration (it's the expected future reward). $Q : S \times A \rightarrow \mathbb{R}$
- Usually we represent $Q$ as a table: we actually store a value for every action at every action. But as you can tell in the case of Go, $Q$ is going to be too big!
- **One idea is to approximate $Q$ with a function that requires less storage.** A function such as a neural network.
  - Start off with a random neural network
  - Keep playing the game to improve the $Q$ function it approximates
  
This is not exactly what AlphaGo zero uses, but it can be used to play a lot of games including Breakout. AlphaGo Zero marries Q-learning with Monte-carlo tree search.

### Search strategy - Monte carlo tree search (MCTS)


![](https://nikcheerla.github.io/deeplearningschool//media/branching.jpg)

One way programs can search the state space graph is to **pick a random sample of branches**, playing each one many times to see which ones have a better probability of winning down the tree. This is called Monte-carlo tree search.

**Approximating a function that guides MCTS**

AlphaGo zero creates a neural network $f_{\theta}$ which approximates the probability distribution among all branches for the next move. Then the sampling algorithm draws from this probability distribution. It also uses the same neural network to guess the expected probability of winning from the state it is applied to.


$$(\textbf{p}, v) = f_{\theta}(s)$$

Here $\textbf{p}$ maps every action available at state $s$ to a probability value that it should be sampled.



### Training

A neural network needs to be told what are good moves and what are not in order to train itself to be better at the game. AlphaGo Zero keeps improving by playing against itself. Winning the game gets a reward of 1, losing
This is the simple algorithm that runs all of AlphaGo Zero!

In [10]:
using WebIO, InteractBase, JSExpr, CSSUtil, Observables

In [11]:
const btn_style = Dict("width"=> 36px, "height"=>36px, "borderRadius" => 4px, "margin"=>2px)
function board()
    scope = Scope()
    click = scope["clicks"] = Observable{Any}(nothing)
    scope["notif"] = Observable("")
    for i=1:3, j=1:3
        scope["cell-$i-$j"] = Observable(" ")
    end
    btns = [button(label=scope["cell-$i-$j"], style=btn_style,
        events=Dict("click" => @js () -> $click[] = $((i,j))))
              for i=1:3, j=1:3]
    grid = hbox(mapslices((x...)->vbox(x...), reshape(btns, (3,3)), dims=1)...)
    scope.dom = hbox(grid, hskip(1em), scope["notif"])(alignitems("center"))
    scope
end

function makechan(ob)
    c = Channel{Any}(0)
    on(ob) do x
        @async put!(c, x)
    end
    c
end



makechan (generic function with 1 method)

In [12]:
b = board()

In [9]:
using AlphaGo, Flux
using BSON: @load

function load_model(str, n, env::AlphaGo.GameEnv)
  @load str*"/agz_base-$n.bson" bn
  @load str*"/agz_value-$n.bson" value
  @load str*"/agz_policy-$n.bson" policy

  @load str*"/weights/agz_base-$n.bson" bn_weights
  @load str*"/weights/agz_value-$n.bson" val_weights
  @load str*"/weights/agz_policy-$n.bson" pol_weights

  Flux.loadparams!(bn, bn_weights)
  Flux.loadparams!(value, val_weights)
  Flux.loadparams!(policy, pol_weights)

  NeuralNet(env; base_net=bn, value=value, policy=policy)
end



load_model (generic function with 1 method)

In [10]:
import AlphaGo: GameEnv, MCTSPlayer, initialize_game!, N, tree_search!, is_done

function play(env::AlphaGo.GameEnv, nn, human_moves, accepted_moves,
              notif; tower_height = 6, num_readouts = 800, mode = 0
  ) #mode=0, human starts with black, else starts with white

  @assert 0 ≤ tower_height ≤ 19

  az = MCTSPlayer(env, nn, num_readouts = num_readouts, two_player_mode = true)

  initialize_game!(az)
  num_moves = 0

  mode = mode == 0 ? mode : 1
  while !is_done(az)
    if num_moves % 2 == mode
      notif[] = "Your turn"
      mv = take!(human_moves)
      move = Tuple(mv)
    else
      notif[] = "AlphaZero's turn"
      current_readouts = N(az.root)
      readouts = az.num_readouts

      while N(az.root) < current_readouts + readouts
        tree_search!(az)
      end

      move = pick_move(az)
      #println(to_kgs(move, az.env))
    end
    if play_move!(az, move)
      accepted_moves[] = (move, num_moves % 2)
      num_moves += 1
    end
  end

  #println(az.root.position)

  winner = result(az.root.position)
  set_result!(az, winner, false)
  mode = mode == 0 ? -1 : 1
  notif[] = string("Result ", az.result_string)
end


play (generic function with 1 method)

In [11]:
# Play with a specific neural network
function play_with(gametype, nn::NeuralNet)
    b = board()

    moves = Observable{Any}(nothing)
    on(moves) do (mv, mode,)
        i, j = mv
        b["cell-$i-$j"][] = string(mode)
    end

    t= @async play(gametype, nn, makechan(b["clicks"]), moves, b["notif"])

    b,t
end


# play with the dumbest network first:
gametype = AlphaGo.GomokuEnv(3,3) # the specific case which is tictactoe
nn = load_model("models", "1", gametype) # there are 50 levels we have trained, second argument is the level

b,t = play_with(gametype, nn)
b

In [12]:
@manipulate for level = slider(1:50, label="Experience level -- (1 level=200 games): ", value=1)
    gametype = AlphaGo.GomokuEnv(3,3) # the specific case which is tictactoe
    nn = load_model("models", string(level), gametype) # there are 50 levels we have trained, second argument is the level

    play_with(gametype, nn)[1]
end
    

In [13]:
import AlphaGo: GameEnv, MCTSPlayer, initialize_game!, N, tree_search!, is_done

function play(env::AlphaGo.GameEnv, nn, human_moves, accepted_moves,
              notif; tower_height = 6, num_readouts = 800, mode = 0
  ) #mode=0, human starts with black, else starts with white

  @assert 0 ≤ tower_height ≤ 19

  az = MCTSPlayer(env, nn, num_readouts = num_readouts, two_player_mode = true)

  initialize_game!(az)
  num_moves = 0

  mode = mode == 0 ? mode : 1
  while !is_done(az)
    if num_moves % 2 == mode
      notif[] = "Your turn"
      mv = take!(human_moves)
      move = Tuple(mv)
    else
      notif[] = "AlphaZero's turn"
      current_readouts = N(az.root)
      readouts = az.num_readouts

      while N(az.root) < current_readouts + readouts
        tree_search!(az)
      end

      move = pick_move(az)
      #println(to_kgs(move, az.env))
    end
    if play_move!(az, move)
      accepted_moves[] = (move, num_moves % 2)
      num_moves += 1
    end
  end

  #println(az.root.position)

  winner = result(az.root.position)
  set_result!(az, winner, false)
  mode = mode == 0 ? -1 : 1
  notif[] = string("Result ", az.result_string)
end


play (generic function with 1 method)

## References

- **Mastering the game of Go without human knowledge**  -- Nature (Oct. 2017)
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel & Demis Hassabis

- **AlphaGo Zero Explained - On AI** blog post https://nikcheerla.github.io/deeplearningschool/2018/01/01/AlphaZero-Explained/

- **AlphaGo.jl** - Julia implementation of the AlphaGo Zero problem on Go and Gomoku -- a generalization of tic-tac-toe