This notebook walks you through the implementation of the wrapper code to train Pokemon teams for battle using keras-rl2 agents. 

**What is wrapper code?**
Well, wrapper code is used here to coordinate between the keras-rl2 library, which contains implementations of several key reinforcement learning algorithms, and the Pokemon Showdown environment. It enables us to train agents which play in the Pokemon Showdown environment using agents from keras-rl2.

The first thing we do is to import nest_asyncio and to apply it. You need not worry too much about how this library works, except to know that Jupyter notebooks execute an event loop, but the code we are running also utilizes an event loop in the backend, so we require this library to allow two event loops.

In [1]:
import nest_asyncio 
nest_asyncio.apply()

Next, we store the type of agent we would like to use in this notebook in the variable 'model_type', so that we can check this variable later in the notebook before declaring the relevant model (DDQN vs SARSA vs CEM). You must restart the kernel every time you want to begin training again.

In [2]:
model_type = "dqn"
#options = cem, sarsa, dqn

For clarity, lets walk through each of the imports in the cell below. 

# General Libraries
*numpy*: this is a common library used to manipulate arrays & perform array-wise operations etc.
*tensorflow*: this is the library we use to train deep neural networks. We will perform deep reinforcement learning (not just reinforcement learning), so we require this library.
*pandas*: this is a common library used to manipulate tables & dataframes.

# poke-env imports
 These imports help us to correspond with the Pokemon Showdown environment. 
 
*PlayerConfiguration*: a class that allows us to store simple attributes such as player username and password.
*LocalhostServerConfiguration*: specifies which localhost we are using to specify our Pokemon-Showdown server. We can also use the main Pokemon-Showdown server (remote) but this would be increasing the burden on the main server which many people are using, so this is highly discouraged

There exists a base class called Player, from which Gen7EnvSinglePlayer, RandomPlayer and FrozenRLPlayer each inherit.

*Gen7EnvSinglePlayer*: this is the class that will be the parent of our reinforcement learning agent class(later declared as SimpleRLPlayer that we will be using for training. It enables us to start battles on the Pokemon Showdown environment, update our neural network etc.

*RandomPlayer*: this is the class that will define players that take completely random moves at each step. It also functions as the parent of the MaxDamagePlayer class. The MaxDamagePlayer will take the move which causes **maximum** damage based on the base power of the move. This is not necessarily the best move in the long-term. Both RandomPlayer and MaxDamagePlayer are used as opponents to our SimpleRLPlayer during training.

*FrozenRLPlayer*: Similar to RandomPlayer and MaxDamagePlayer, this player is used as an opponent to the SimpleRLPlayer during training. The FrozenRLPlayer is initialized using a pre-trained RL agent from a previous iteration of training the SimpleRLPlayer. Using this FrozenRLPlayer we can include some self-play in our implementation.

# keras-rl imports
These imports help us to correspond with keras-rl2 agents.

*CEMAgent*, *DQNAgent*, *SARSAAgent*: These are self-explanatorily imports of each of the respective RL agents that we utilize. 

*rl.policy imports*: These help us to import different policies for our agents to follow. For example, an epsilon greedy policy means that our agent chooses the move that has the highest value, but chooses a random move with probability epsilon, to aid in exploration (vs the pure exploitation that would be undertaken by a greedy policy). The linear annealment allows us to decay epsilon (erring towards exploitation later in training).

*rl.memory imports*: These help us to import different types of memory. For example, ddqn and sarsa use sequential memory, whilst cem uses episode parameter memory. (FINISH THIS EXPLANATION)


# tensorflow imports
The *tensorflow.keras* imports are self-explanatory for anyone who has used tensorflow. We are importing the different types of layers, models and optimizers. Here is a nice intro to keras, for those unfamiliar with it: https://towardsdatascience.com/introduction-to-deep-learning-with-keras-17c09e4f0eb2

In [3]:
import numpy as np
import tensorflow as tf
import pandas as pd

from poke_env.player_configuration import PlayerConfiguration
from poke_env.player.env_player import Gen7EnvSinglePlayer
from poke_env.player.random_player import RandomPlayer
from poke_env.player.frozen_rl_player import FrozenRLPlayer
from poke_env.server_configuration import LocalhostServerConfiguration

from rl.agents.cem import CEMAgent
from rl.agents.dqn import DQNAgent
from rl.agents.sarsa import SARSAAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory, EpisodeParameterMemory

from tensorflow.keras.layers import Dense, Flatten, Activation
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


# Defining Our Players

We pre-imported our Gen7EnvSinglePlayer, RandomPlayer and FrozenRLPlayer (I would strongly encourage you to look into both those files to see how we define these classes). However, below, we define the SimpleRLPlayer, which is the agent that we will train, and the MaxDamagePlayer, which is one of our three types of opponents, as explained earlier.

*SimpleRLPlayer*
We can see that the player has two methods defined here i.e. *embed_battle* and *compute_reward* respectively. *embed_battle* is used to embed the current state of the battle between two pokemon teams. *compute_reward* is used to compute the current reward. We require both of these things in order to select the next action, as well as to update our deep RL model weights. You will find it interesting to note that the FrozenRLPlayer has very similar methods to these two, and can be thought of as a simpler version of SimpleRLPlayer that has fixed weights, and cannot be updated or start battles. FrozenRLPlayer inherits straight from Player, unlike SimpleRLPlayer which is more powerful and inherits from Gen7EnvSinglePlayer.

*MaxDamagePlayer*
As we can see, the next move is decided purely on the basis of the base power of each of the moves available to the pokemon.

In [4]:
# We define our RL player
# It needs a state embedder and a reward computer, hence these two methods
class SimpleRLPlayer(Gen7EnvSinglePlayer):
    def embed_battle(self, battle):
        # -1 indicates that the move does not have a base power
        # or is not available
        moves_base_power = -np.ones(4)
        moves_dmg_multiplier = np.ones(4)
        for i, move in enumerate(battle.available_moves):
            moves_base_power[i] = (
                move.base_power / 100
            )  # Simple rescaling to facilitate learning
            if move.type:
                moves_dmg_multiplier[i] = move.type.damage_multiplier(
                    battle.opponent_active_pokemon.type_1,
                    battle.opponent_active_pokemon.type_2,
                )

        # We count how many pokemons have not fainted in each team
        remaining_mon_team = (
            len([mon for mon in battle.team.values() if mon.fainted]) / 6
        )
        remaining_mon_opponent = (
            len([mon for mon in battle.opponent_team.values() if mon.fainted]) / 6
        )

        # Final vector with 10 components
        return np.concatenate(
            [
                moves_base_power,
                moves_dmg_multiplier,
                [remaining_mon_team, remaining_mon_opponent],
            ]
        )

    def compute_reward(self, battle) -> float:
        return self.reward_computing_helper(
            battle, fainted_value=2, hp_value=1, victory_value=30
        )


class MaxDamagePlayer(RandomPlayer):
    def choose_move(self, battle):
        # If the player can attack, it will
        if battle.available_moves:
            # Finds the best move among available ones
            best_move = max(battle.available_moves, key=lambda move: move.base_power)
            return self.create_order(best_move)

        # If no attack is available, a random switch will be made
        else:
            return self.choose_random_move(battle)


# Time to train!

Finally, we are ready to think about training our agents. We quickly define two functions, one for training and one for evaluation. 

In [5]:
NB_TRAINING_STEPS = 10000
NB_EVALUATION_EPISODES = 100

# variable for naming .csv files.
# Change this according to whether the training process was carried out against a random player or a max damage player
TRAINING_OPPONENT = 'RandomPlayer'

tf.random.set_seed(0)
np.random.seed(0)


# This is the function that will be used to train the agent
def agent_training(player, agent, nb_steps, filename):
    model = agent.fit(player, nb_steps=nb_steps)
    # save model history to csv
    save_file = f"{filename}_trainlog_{nb_steps}eps.csv"
    print("===============================================")
    print(f"Saving model history as {save_file}")
    print("===============================================")
    pd.DataFrame(model.history).to_csv(save_file)
    player.complete_current_battle()


def agent_evaluation(player, agent, nb_episodes, filename):
    # Reset battle statistics
    player.reset_battles()
    model = agent.test(player, nb_episodes=nb_episodes, visualize=False, verbose=False)

    # save model history to csv
    save_file = f"{filename}_testlog_{nb_episodes}eps.csv"
    print("===============================================")
    print(f"Saving model history as {save_file}")
    print("===============================================")
    pd.DataFrame(model.history).to_csv(save_file)
    
    print(
          "CEM Evaluation: %d victories out of %d episodes"
          % (player.n_won_battles, nb_episodes)
          )

We are performing **deep** reinforcement learning, so we first need to define a keras network for this training. We define the network slightly differently, depending on whether we are using CEM, DQN or SARSA. In the cell below, we first instantiate the four sorts of players that we use. Then, we define the neural network structures:

For DQN & SARSA, we use identical structures and a linear activation, whilst we use a softmax activation for CEM. This makes sense because CEM automatically predicts the policy, whilst DQN and SARSA predict the q-values which are then converted into a policy.


In [6]:
if __name__ == "__main__":
    env_player = SimpleRLPlayer(
        player_configuration=PlayerConfiguration("satunicarina", None),
        battle_format="gen7randombattle",
        server_configuration=LocalhostServerConfiguration,
    )

    opponent = RandomPlayer(
        player_configuration=PlayerConfiguration("duanicarina", None),
        battle_format="gen7randombattle",
        server_configuration=LocalhostServerConfiguration,
    )

    second_opponent = MaxDamagePlayer(
        player_configuration=PlayerConfiguration("tiganicarina", None),
        battle_format="gen7randombattle",
        server_configuration=LocalhostServerConfiguration,
    )
    
    third_opponent = FrozenRLPlayer(
                                    player_configuration=PlayerConfiguration("empatnicarina", None),
                                    battle_format="gen7randombattle",
                                    server_configuration=LocalhostServerConfiguration
                                    )
    #output dimension
    n_action = len(env_player.action_space)
    
    if model_type=='cem':
        # Output dimension
        memory = EpisodeParameterMemory(limit=10000, window_length=1)
        # deep network
        model = Sequential()
        model.add(Flatten(input_shape=(1, 10)))
        model.add(Dense(16))
        model.add(Activation('relu'))
        model.add(Dense(16))
        model.add(Activation('relu'))
        model.add(Dense(16))
        model.add(Activation('relu'))
        model.add(Dense(n_action))
        model.add(Activation('softmax'))

        # Ssimple epsilon greedy
        policy = LinearAnnealedPolicy(
            EpsGreedyQPolicy(),
            attr="eps",
            value_max=1.0,
            value_min=0.05,
            value_test=0,
            nb_steps=10000,
        )
#         #only uncomment below line if you want to continue training an old model
#         model = tf.keras.models.load_model('/Users/nicarinanan/Desktop/poke-env/modelpostmax2preserve_20000')


        # Defining our agent
        agent = CEMAgent(model=model, nb_actions=n_action, memory=memory,
                       batch_size=50, nb_steps_warmup=1000, train_interval=50, elite_frac=0.05, noise_ampl=4)


        agent.compile()
    
    elif model_type=='dqn' or model_type=='sarsa':

        model = Sequential()
        model.add(Dense(128, activation="elu", input_shape=(1, 10)))

        # Our embedding have shape (1, 10), which affects our hidden layer
        # dimension and output dimension
        # Flattening resolve potential issues that would arise otherwise
        model.add(Flatten())
        model.add(Dense(64, activation="elu"))
        model.add(Dense(n_action, activation="linear"))

        memory = SequentialMemory(limit=10000, window_length=1)

        # Simple epsilon greedy
        policy = LinearAnnealedPolicy(
            EpsGreedyQPolicy(),
            attr="eps",
            value_max=1.0,
            value_min=0.05,
            value_test=0,
            nb_steps=10000,
        )
#         #only uncomment below line if you want to continue training an old model
#         model = tf.keras.models.load_model('/Users/nicarinanan/Desktop/poke-env/modelpostmax2preserve_20000')

        # Defining our DQN
        if model_type=='dqn':
            agent = DQNAgent(
                model=model,
                nb_actions=18,
                policy=policy,
                memory=memory,
                nb_steps_warmup=1000,
                gamma=0.5,
                target_model_update=1,
                delta_clip=0.01,
                enable_double_dqn=True,
            )
        elif model_type=='sarsa':
            agent = SARSAAgent(model=model, nb_actions=n_action, nb_steps_warmup=1000, policy=policy)

        agent.compile(Adam(lr=0.00025), metrics=["mae"])
        


    # Training
    env_player.play_against(
        env_algorithm=agent_training,
        opponent=opponent,
                            env_algorithm_kwargs={"agent": agent, "nb_steps": NB_TRAINING_STEPS, "filename": TRAINING_OPPONENT+'Notebook'},
    )
    model.save("model_notebook_%d" % NB_TRAINING_STEPS)

    # Evaluation
    print("Results against random player:")
    env_player.play_against(
        env_algorithm=agent_evaluation,
        opponent=opponent,
        env_algorithm_kwargs={"agent": agent, "nb_episodes": NB_EVALUATION_EPISODES, "filename": f'({TRAINING_OPPONENT}_{NB_TRAINING_STEPS})RandomPlayerNotebook'},
    )

    print("\nResults against max player:")
    env_player.play_against(
        env_algorithm=agent_evaluation,
        opponent=second_opponent,
        env_algorithm_kwargs={"agent": agent, "nb_episodes": NB_EVALUATION_EPISODES, "filename": f'({TRAINING_OPPONENT}_{NB_TRAINING_STEPS})MaxPlayerNotebook'},
    )

    print("\nResults against frozen rl player:")
    env_player.play_against(
                            env_algorithm=agent_evaluation,
                            opponent=third_opponent,
                            env_algorithm_kwargs={"agent": agent, "nb_episodes": NB_EVALUATION_EPISODES, "filename": f'({TRAINING_OPPONENT}_{NB_TRAINING_STEPS})MaxPlayerNotebook'},
                            )


Training for 10000 steps ...
Interval 1 (0 steps performed)
done, took 119.238 seconds
Saving model history as RandomPlayerNotebook_trainlog_10000eps.csv
INFO:tensorflow:Assets written to: model_notebook_10000/assets
Results against random player:
Saving model history as (RandomPlayer_10000)RandomPlayerNotebook_testlog_100eps.csv
CEM Evaluation: 95 victories out of 100 episodes

Results against max player:
Saving model history as (RandomPlayer_10000)MaxPlayerNotebook_testlog_100eps.csv
CEM Evaluation: 67 victories out of 100 episodes

Results against frozen rl player:
Saving model history as (RandomPlayer_10000)MaxPlayerNotebook_testlog_100eps.csv
CEM Evaluation: 46 victories out of 100 episodes
