This notebook has notes and thoughts with most recent chapters in the beginning, and older thoughts, as you read on.

# After A2C

A2C can now be seen live at work in [This Notebook](A2C_Curriculum.ipynb). It failed to produce a good policy. Indeed, it produces uniform distributions even from initially well-performing policies (from imitation learning) - which is the worst outcome.

### Potential issues
#### Frustration
Incoming gradients from the losing side may increase the entropy such that the policy essentially creates a uniform distribution. 

Frustration may arise from the *critic* not being consistent yet. It's definitely worth looking at the quality of the *critic*'s advantages. On any single trajectory, the perfect critic would match the alernating bellmann scheme.

Frustration will most likely arise from the alternating scheme in general, as negative and positive signals are evenly distributed. 

It would be interesting to report the entropy as a metric.

One quick win could come from switching off the losing signal totally. But still, then the value signal will be noisy enough to provide almost arbitrary feedback also for the winner's moves. 

#### Off-by-one
We still have that mysterious off-by-one problem in function ```analyse_and_recommend```.
We need to understand what it is to rule out that despite the hack, it still provides a wrong recommendation.

### Preliminary conclusion
I now believe that A2C can't be successful with such a hard problem. Not even a little performance increase from imitation-learned trajectories is possible. This should definitely provide a warning in view of my CFDS project: Keep it simple!

### Next steps 
It would really be interesting to see the setup success on a less demanding problem - just to see it working. I have QLearning tutorial in [this notebook](RL_QLearning.ipynb), that might provide a solvable toy problem. 
Then it is certainly advisable to implement the openai gym interfaces.

The Gomoku problem can still be solved. With the policy network and the RL algorithm in place, an actor-learner can be implemented with manageable effort. An actor-learner scheme has a great chance to succeed.

### Thoughts on policy-advised UCT
The currently best policy and value network snapshots are 
```
policy_model = PolicyModel(board_size=19, n_blocks=10, 
                    n_layers=3, n_filters=32, 
                    activation='relu')
policy_model.load_weights("./models/PolicyNet_1.0/cp-0003.ckpt")

value_model = ValueModel(19, 10, 3, 32, 'relu')
value_model.load_weights("./models/ValueNet_3.0/cp-0001.ckpt")
```
These networks have been trained on the P5K dataset of 4600 heuristic gameplays until convergence. See [this notebook](A2C_PolicyNetwork.ipynb) and [this notebook](A2C_ValueNetwork.ipynb) for details.

The policy may advise the tree search for the initial distribution (Thus, it's no longer MCTS). The value function will return feedback instead the rollout or allow the rollout to stop before a final state is reached. 

To boost performance, we should use the Ray framework (I started with [this notebook](RayTutorial.ipynb), but didn't really get anywhere yet). 

The most compelling idea at the moment:
- have a master actor to start parallel tree searches
- the search actors submit policy eval requests to a dispatcher
- the dispatcher collects requests until an NN actor becomes available,
- the NN actor processes the requests in batch and returns the result off-line
- the dispatcher actor dispatches the results to its clients
- maybe the NN actors could address different GPUs?

Check Norway-notes, too!

---
# Thinking fast and slow

I want to skip RL in favour of the Alpha-Zero approach with a policy-advised tree search, possibly considering RL for some side-line improvement later. However, I'm still going to initialize the network with Imitation learning from my heuristic policy.

Interestingly, a pretty similar approach has been suggested by [Anthony 2017](https://arxiv.org/pdf/1705.08439.pdf), independent of the research done by Deepmind. Would be interesting to compare the approaches.

[This essay](http://www.moderndescartes.com/essays/deep_dive_mcts/) is the most concise and comprehensible piece on UCT. It refers to a *NeuralNet* to provide value estimates for child nodes. I want to start from that, as it also advises an approach to vectorization to massively improve the performance of the search algorithm. The above algorithm takes a single policy evaluation at the child nodes to estimate a parent's value. It'd be interesting to consider some fast policy to chase down four or five more moves and average their results. Another thing is the formula used for evaluation of the UCB. That's adding 1 to the denominators for stability (precondition for vectorizing the UCB calculation. But it's also omitting the exploration parameter and doesn't take the logarithm on the parent's number of simulations. [This Medium blog](https://medium.com/@quasimik/monte-carlo-tree-search-applied-to-letterpress-34f41c86e238) has the correct formula and some more helpful explanations.

We'll have the architecture derived in [LinesOfFive.ipynb](LinesOfFive.ipynb) learn by imitating the [HeuristicGomokuPolicy](HeuristicPolicy.py). The latter needs to have some function that maps the logic of method ```suggest``` into a learnable distribution.

That should already create a pretty strong player. Additional steps would possibly include ideas from [Anthony 2017](https://arxiv.org/pdf/1705.08439.pdf) to have system 1 (the policy network) and system 2 (the UCT algorithm) learn from each other.

[HeuristicPolicy.ipynb](HeuristicPolicy.ipynb) is now the starting point for creating initial training data. I still need to find out how to effectively reflect the results of the threat sequence search in the resulting action (move) distribution.

I could start with implementing UCT with the heuristic policy and see how it does.

Another hard thing is then the full documentation and operationalization of the entire quest. Providing an interactive interface to play with the algo. A web version of GO-UI also being able to run tournaments. Also, benchmarking my algo against the available players at the official Gomoku tournament site is to be considered.

Last, not least, the entire thing should be presentable on various occasions, meetups, conferences, whatever.

# The Final AI Actor

The final actor should have an openai environment interface. It features a 
variable-depth policy-advised threat-search. This policy should be trained on 
threat sequences. It could start as a regular tactical policy that has an extra sense for threats. The other policy should remain strong on tactics. Well, that may not be easy to achieve. Both policies need to be DRL-trained. So the feedback from a trajectory would need to be distributed according to the various roles. May not be easy...;-( 
The tactical tree search will have a tree different from the threat search - obviously,
so RL would come in two phases. Up to the start of a threat search, the tactical tree and policy would be trained in some manner. Then the threat sequence itself will make training examples for the ts policy. Good thing here: Each sub-sequence is just another threat sequence, so learning from threat sequences essentially becomes supervised learning.

---
# Resources

The best resource I have found about policy gradients:

Berkeley's Joshua Achiam's Lecture Slides

[Achiam2017](http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf)

---
Somewhat useful: Jonathan Hui's Blog about Natural Policy Gradient and TRPO

[Hui2018](https://medium.com/@jonathan_hui/rl-natural-policy-gradient-actor-critic-using-kronecker-factored-trust-region-acktr-58f3798a4a93)

---

Also pretty readable: Berkeley's Sergey Levine's Lecture notes on Actor-Critic Algorithms:

[Levine](http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_5_actor_critic_pdf.pdf)

---

The TRPO Paper:

[Schulmann2017] (https://arxiv.org/pdf/1502.05477.pdf)

---

Excellent overview over the algos in "Towards Data Science"

[Huang2018-1](https://towardsdatascience.com/introduction-to-various-reinforcement-learning-algorithms-i-q-learning-sarsa-dqn-ddpg-72a5e0cb6287)

---

The most approachable code I've seen by now - and even TF2.0
[Ring2019](http://inoryy.com/post/tensorflow2-deep-reinforcement-learning/)

---

Soft Actor-Critic
[Aarnoya2018](https://arxiv.org/pdf/1801.01290.pdf)

---

Tensorflow estimators:

[Cheng2017](https://arxiv.org/pdf/1708.02637.pdf)

---

