# Chapter 9: Deep Learning in the Coin Game



***
*“My CPU is a neural net processor, a learning computer. The more contact I have with humans, the more I learn.”*

-- The Terminator, in Terminator 2: Judgement Day
***



What you'll learn in this chapter:

* The architecture of a neural network
* How deep learning is related to machine learning and artificial intelligence
* Steps and Components of the AlphaGo algorithm
* Building and training a fast policy network and a strong policy network in the coin game
* Implementing an MCTS game strategy with policy rollouts

Starting from this chapter, you’ll learn a new AI
paradigm: machine learning (ML). Instead of hard
coding in the rules, ML algorithms take in input-output pairs and figure out the
relation between the inputs (which we call features) and outputs (the labels). One
field of ML, deep learning, has attracted much attention recently. The algorithm used
by AlphaGo is based on deep reinforcement learning, which is a combination of deep
learning and reinforcement learning (a type of ML we’ll cover later in this book). In
this chapter, you’ll learn what deep learning is and how it’s related to AI and ML.

Deep learning is a type of ML method that’s based on artificial neural networks. A
neural network is a computational model inspired by the structure of neural networks
in the human brain. It’s designed to recognize patterns in data, and it contains layers
of interconnected nodes, or neurons. In this chapter, you’ll learn to use deep neural
networks to design game strategies for the coin game. In particular, you’ll follow the
steps in AlphaGo and create two policy networks. We’ll use these networks later in
the book to create an AlphaGo agent to play the coin game.

Specifically, the AlphaGo algorithm follows the following steps. We first gather a
large number of games played by Go experts and use deep learning to train two
policy networks to predict the moves of the Go experts: a fast policy network and a
strong policy network. In the second step, we use self-play deep reinforcement learning
to further train and improve the strong policy network. At the same time, we train a
value network to predict game outcomes by using the game experience data from the
self-plays. Finally, we design a game strategy based on an improved version of MCTS. Instead of using the upper confidence bounds for trees (UCT) formula to select the
next move, AlphaGo uses a combination of the UCT formula, the improved strong
policy network, and the value network. Further, instead of randomly selecting moves
in game rollouts, AlphaGo uses the fast policy network to roll out games.

In this chapter, you’ll implement the first step in the AlphaGo algorithm in the coin
game. Specifically, you’ll use the rule-based AI we developed in Chapter 1 to generate
expert moves.We then create two neural networks and use the generated expert moves
to train the two networks to predict moves. You’ll then implement policy rollouts in
MCTS, where games are played based on the probability distribution from the fast
policy network, leading to a more intelligent MCTS agent compared to the traditional
one.

# 1. Deep Learning, ML, and AI

# 2. What Are Neural Networks?

# 3.  Two Policy Networks in the Coin Game
# 4. Train Two Networks in the Coin game

In [1]:
import numpy as np
import random

def expert(env):
    if env.state%3 != 0:
        move = env.state%3
    else:
        move = random.choice([1,2])
    return move    

def non_expert(env):
    if env.state%3 != 0 and np.random.rand()<0.5:
        move = env.state%3
    else:
        move = random.choice([1,2])
    return move  

In [2]:
from utils.coin_simple_env import coin_game
import time

# Initiate the game environment
env=coin_game()
# Define the one_game() function
def one_game(episode):
    history=[]
    state=env.reset()  
    # The nonexpert moves firsts half the time
    if episode%2==0:
        action=non_expert(env)
        state,reward,done,_=env.step(action)
    while True:   
        action=expert(env)  
        history.append((state,action))
        state,reward,done,_=env.step(action)
        if done:
            break
        action=non_expert(env)
        state,reward,done,_=env.step(action)     
        if done:
            break
    return history

# Simulate one game and print out results
history=one_game(0)
print(history)        

[(20, 2), (17, 2), (14, 2), (11, 2), (8, 2), (4, 1), (2, 2)]


In [3]:
# simulate the game 10000 times 
results = []        
for episode in range(10000):
    history=one_game(episode)
    results+=history   

In [4]:
import pickle
# save the simulation data on your computer
with open('files/games_coin.p', 'wb') as fp:
    pickle.dump(results,fp)
# read the data and print out the first 10 observations       
with open('files/games_coin.p', 'rb') as fp:
    games = pickle.load(fp)
print(games[:10])

[(19, 1), (16, 1), (13, 1), (10, 1), (7, 1), (4, 1), (1, 1), (21, 2), (18, 2), (14, 2)]


## 4.2. Create Two Neural Networks


In [5]:
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

fast_model = Sequential()
fast_model.add(Dense(units=32,activation="relu",
                 input_shape=(22,)))
fast_model.add(Dense(2, activation='softmax'))
fast_model.compile(loss='categorical_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

2024-09-04 18:11:39.277724: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-04 18:11:45.849632: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [6]:
strong_model = Sequential()
strong_model.add(Dense(units=64,activation="relu",
                 input_shape=(22,)))
strong_model.add(Dense(32, activation="relu"))
strong_model.add(Dense(16, activation="relu"))
strong_model.add(Dense(2, activation='softmax'))
strong_model.compile(loss='categorical_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

## 4.3. Train the Neural Networks


In [7]:
states=[20,1]
one_hot=to_categorical(states,22)
print(one_hot)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


In [8]:
actions=[1,2]
# change actions 1 and 2 to 0 and 1.
actions=np.array(actions)-1
# change actions to one-hot actions
one_hot_actions=to_categorical(actions,2)
print(one_hot_actions)

[[1. 0.]
 [0. 1.]]


In [9]:
with open('files/games_coin.p','rb') as fp:
    games=pickle.load(fp)

states = []
actions = []
for x in games:
    state=to_categorical(x[0],22)
    action=to_categorical(x[1]-1,2)
    states.append(state)
    actions.append(action)

X = np.array(states).reshape((-1, 22))
y = np.array(actions).reshape((-1, 2))

In [10]:
# Train the models for 25 epochs
fast_model.fit(X, y, epochs=25, verbose=1)
fast_model.save('files/fast_coin.h5')

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [11]:
strong_model.fit(X, y, epochs=25, verbose=1)
strong_model.save('files/strong_coin.h5')

Epoch 1/25


Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


# 5. MCTS with Policy Rollouts in the Coin Game


## 5.1. Policy-Based MCTS in the Coin Game
  

Go to book's GitHub repository to download the file *ch09util.py* and place it in the folder /Desktop/ags/utils/ on your computer. In the file, we define a *DL_stochastic()* function as follows:

In [12]:
def onehot_encoder(state):
    onehot=np.zeros((1,22))
    onehot[0,state]=1
    return onehot

def DL_stochastic(env, model): 
    state = env.state
    onehot_state = onehot_encoder(state)
    action_probs = model(onehot_state)
    return np.random.choice([1,2], 
            p=np.squeeze(action_probs))

In [13]:
def policy_simulate(env_copy,done,reward,model):
    # if the game has already ended
    if done==True:
        return reward
    while True:
        move=DL_stochastic(env_copy,model)
        state,reward,done,info=env_copy.step(move)
        if done==True:
            return reward

In [15]:
from utils.ch08util import select, expand, backpropagate, next_move

def policy_mcts_coin(env,model,num_rollouts=100,temperature=1.4):
    # if there is only one valid move left, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # create three dictionaries counts, wins, losses
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    # roll out games
    for _ in range(num_rollouts):
        # selection
        move=select(env,counts,wins,losses,temperature)
        # expansion
        env_copy, done, reward=expand(env,move)
        # simulation
        reward=policy_simulate(env_copy,done,reward,model)
        # backpropagate
        counts,wins,losses=backpropagate(\
            env,move,reward,counts,wins,losses)
    # make the move
    return next_move(counts,wins,losses)

## 5.2. The Effectiveness of the Policy MCTS Agent


In [41]:
from utils.ch08util import mcts
#from utils.ch09util import policy_mcts_coin

model = fast_model
env=coin_game()
results=[]
for i in range(100):
    state=env.reset() 
    # Half the time, the UCT MCTS agent moves first
    if i%2==0:
        action=mcts(env,num_rollouts=100)
        state, reward, done, info=env.step(action)
    while True:
        action=policy_mcts_coin(env,model,num_rollouts=100) 
        state, reward, done, info=env.step(action)
        if done:
            # result is 1 if the policy MCTS agent wins
            results.append(1)    
            break  
        action=mcts(env,num_rollouts=100)
        state, reward, done, info=env.step(action)
        if done:
            # result is -1 if the policy MCTS agent loses
            results.append(-1)   
            break  

In [43]:
wins=results.count(1)
print(f"the policy MCTS agent has won {wins} games")
losses=results.count(-1)
print(f"the policy MCTS agent has lost {losses} games")   

the policy MCTS agent has won 99 games
the policy MCTS agent has lost 1 games
