This is an implementation of a swergio agent based on the A2C reinforcement learning model including a VAE-GAN approach and a MemN2N network.
The model consists of four main components. Similar to the VAE-GAN approach three components are the Encoder, Generator, and Discriminator. Additional the Critic is implemented for reinforcement learning with the A2C approach.
The encoders objective is to transform the input information (current and historic Questions & Answers ) to a latent representation. To reflect the information of prior messages the encoder is using a MemN2N model with an attention mechanism.
The generator generates the five required information of a future action/ message based on the provided latent vector representation of the current state.
- Act Flag: Decide if the Agent should act based on the input.
- Chat Type: Decide what type of chat (e.g. AGENT, KNOWLEDGEBASE...) the action/message should be sent to
- Chat Number: Decide to which chat of the selected types the action/message should be sent
- Message Type: Decide what type of message (e.g. QUESTION, ANSWER) the agent want to sent
- Message Text: Generate the action/message text To generate the message text we use a recurrent neural net with GRU cells.
The discriminator is trained to distinct between "real" action/messages (expert data) and "fake" (generated). The input for the discriminator is given by the latent vector of the current state (history + last messages) and the suggested action/message. To optimize the discriminator the loss function of a "least squares GAN" is used. To avoid a non-differentiable network, the generator receives the discriminator feedback as a reward and optimize its policy accordingly.
The critic approximates the value function of the current state. It receives the latent vector from the encoder. The critic is only used while running the model online mode.
The agent can be trained offline (using "expert data" as ground truth) or online (using firsthand feedback of other swergio components).
In offline mode, the model is trained based on given expert data. This includes the encoder, the generator as well as the discriminator. To fit the discriminator to "fake" data, the generator generates messages based on a random latent vector input. See the chart for an overview:
To start the offline training we can run the "runOffline.py" file.
In online mode, the model is trained based on the received feedback of connected swergio components. The online runner is separated into three main modules.
The actor will receive the last observation and uses the network to generate the next action. To further explore the observation state the actor will randomly choose an action that is not necessarily optimized.
Up to N of the last observations and final actions of the actor are stored in the experience class.
After N steps or end of an episode (next question from a client), the trainer trains the model based on the saved experiences. In addition to the experience, expert data is used to continuously train the generator and discriminator on "real" data.
See the chart for an overview:
To start the online mode run the "run.py" file.
Argument: --logdir
Enviroment variable: LOG_PATH
Argument: --savedir
Enviroment variable: MODELSAVE_PATH
Argument: --expertdata
Enviroment variable: EXPERTDATA_PATH
Argument: --agents (e.g. = agent_A, agent_B) Enviroment variable: AGENT
Argument: --worker (e.g. = worker_A, worker_B) Enviroment variable: WORKER
Argument: --knowledge (e.g. = kb_A, kb_B) Enviroment variable: KNOWLEDGEBASE
Argument: --batchsize = 20
Argument: --memorysize = 10
Argument: --latentsize = 30
Argument: -li,--logint =500
Argument: -si,--saveint =5000
Argument: --entcoef = 0.01
Argument: --latentlossweight = 1
Argument: --generationlossweight = 1
Argument: --GANGlossweight = 1
Argument: --GANDlossweight = 1
Argument: --policylossweight = 1
Argument: --criticlossweight = 0.5
Argument: --alpha = 0.99
Argument: --epsilon = 1e-5
Argument: --maxgradnorm = 0.5
Argument: -lr,--learningrate =7e-3
Argument: -ts,trainsteps-- =50000
Argument: --gamma =0.99
Argument: --gaelambda =0.96
Argument: --actprobability =1
Argument: --explorprobability =0.05
This is an overview of the main sources and ideas being used for code snippets, implementation, and inspiration of this model.
Publication - https://arxiv.org/abs/1602.01783 Example implementation in Open AI's baselines library - https://github.com/openai/baselines
Publication - https://arxiv.org/abs/1611.04076v2 Blog Post with explanation and implementation from Augustin Kristiadi - https://wiseodd.github.io/techblog/2017/03/02/least-squares-gan/
Publication - https://arxiv.org/abs/1503.08895v4 Example implementation -https://github.com/carpedm20/MemN2N-tensorflow
Publication - https://arxiv.org/pdf/1512.09300.pdf Example implentation - https://github.com/timsainb/Tensorflow-MultiGPU-VAE-GAN
Explanation and example implementation in tutorial for tensorflow - https://github.com/tensorflow/nmt