Skip to content

An old research proposal: Comprehensive Conditioning of Neural Conversational Models

Richard Csaky edited this page Jul 19, 2019 · 1 revision

Abstract

Neural network based approaches to conversational modeling have been prevalent in the last three years. While there have been a multitude of techniques proposed in order to augment the performance of dialog agents, open-domain chatbots still tend to produce generic and safe responses, without much diversity [Li et al., 2015, Vinyals and Le, 2015]. This is caused by the learning target not being welldefined for training conversational models. In my work I plan to explore ideas meant to address this issue. Namely, building dialog models that are conditioned on more prior information, like persona, mood, world-knowledge, and outside factors. This should, in theory, ensure that the models do not simply average out ambiguities in the dataset, and thus a more natural and diverse chatbot could be created.

Introduction

A conversational agent is a software that is able to communicate with humans using natural language. Modeling conversation is an important task in natural language processing and artificial intelligence. Open-domain chatbots are useful for augmenting task-oriented dialog agents to handle out-of-domain utterances [Yu et al., 2017, Zhao et al., 2017, Akasaki and Kaji, 2017], and they provide an important step towards artificial general intelligence. Such chatbots are used in a plethora of real-world applications, for example Apple’s Siri and Amazon’s Alexa.

Background

Neural network based conversational models have been dominating the field ever since the introduction of the seq2seq model [Cho et al., 2014, Vinyals and Le, 2015]. Furthermore, a multitude of augmentations and extensions have been proposed to increase the performance of seq2seq based dialog agents. The hierarchical recurrent encoder-decoder model [Serban et al., 2016, Serban et al., 2017] makes it possible to take into account previous dialog turns effectively, and the incorporation of attention mechanisms [Bahdanau et al., 2014] has also been explored [Xing et al., 2017b]. Reinforcement learning and adversarial approaches to conversational modeling have been analyzed as well [Li et al., 2016c, Li et al., 2016b, Kandasamy et al., 2017, Li et al., 2017]. An issue with current chatbot models is that through the standard loss function used to train them, they learn to maximize the probability of predicting a gold truth utterance given an input utterance. However, this is not a good approach, since for example for the question How are you?, a lot of equally good answers exist. This ambiguousness makes the neural conversational models (NCM) learn generic and safe responses [Vinyals and Le, 2015, Li et al., 2015, Serban et al., 2016]. In order to combat this, and produce more diverse outputs, other loss functions [Li et al., 2015] and various features have been proposed [Li et al., 2016a, Xing et al., 2017a].

Hypothesis and Objectives

In my work I plan to explore this loss function problem related to chatbots, and experiment with possible solutions. The goal of the project would be to build more natural and diverse open-domain dialog agents. My presumption is that NCMs output safe and generic responses, because the loss function forces the model to average out ambiguities, if more gold truths are given to an input utterance in the dataset. In the vector embedding space, an average of various responses should point in the direction where these generic and safe replies are located, since these have been seen by the model as gold truth outputs for many source utterances and they don’t carry much information. A first step in my thesis work would be to test this hypothesis. Furthermore, I reason that in order to differentiate between different replies to the same source utterance, and address the loss function issue, several priors should be used to train neural models.

Methodology

After testing my hypothesis related to the loss function, the next step would be to condition the generation of responses on the speaker and his/her mood, besides taking into account previous dialog turns, in the first few months of my project. Since one’s conversational style depends a lot on one’s personality and mood, these factors play a big role in the disambiguation of replies. Moreover, in my first year, I also plan to experiment with making use of unsupervised techniques employed for language modeling in order to build world-, language-, and conversationknowledge representations, that can serve as further input information in NCMs. An example for world-knowledge would be the color of the sky, language-knowledge is about learning the meaning of words, and conversation-knowledge is about learning to answer with yes/no to yes-or-no questions for example. There is one final prior in my opinion, that needs to be taken into account to truly capture the conditioning space of conversations, which I would experiment with in my second year. The reply I got hit by a car. to the question How are you? for example, has little to do with persona or mood information. Rather it is simply conditioned on outside factors. This prior information could be incorporated into NCMs, by changing the person’s speaker representation slightly for this specific response, or a new representation that tries to encode outside factors could be used. Lastly, I plan to make dialog agents more human-like by adding temporal conditioning and real-time model updates, in the second half of the second year. Temporal-conditioning is an additional term to the loss function that takes into account the elapsed time since the last utterance in a conversation, and real-time model updates are backpropagation steps carried out during real-time conversation, so that the dialog agent can memorize information about users. I plan to test the addition of these priors and features with several architectures used for building dialog agents, and if time permits I want to experiment with constructing my own neural network based models. Neural architecture search [Zoph and Le, 2016] for example is a promising line of research for finding better neural models for a specific task, but it has not been applied to conversational modeling yet.

Previous Work

In my previous work [Csaky, 2017] I have explored the reasons behind the problems of the standard loss function, and ideas towards solving them in more detail. I’ve also conducted an in-depth literature survey, reading and analyzing over 70 publications in the field. Moreover, I experimented with adapting the Transformer model [Vaswani et al., 2017] to the conversational domain, and I plan to test the additional priors on this model as well.

Conclusion

The aim of my project would be to tackle the loss function issue of current chatbot models, manifesting in the fact that it simply tries to maximize the output probability of a sequence given an input. In order to combat this issue I propose a joint approach, where NCMs are conditioned on several priors and features. In addition, I plan to make chatbots more natural by implementing temporal conditioning and real-time model updates. Lastly, I would like to test these ideas on various models, like the base seq2seq and the Transformer model.

References

[Akasaki and Kaji, 2017] Akasaki, S. and Kaji, N. (2017). Chat detection in an intelligent assistant: Combining task-oriented and non-task-oriented spoken dialogue systems. arXiv preprint arXiv:1705.00746.
[Bahdanau et al., 2014] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[Cho et al., 2014] Cho, K., Van Merri¨enboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
[Csaky, 2017] Csaky, R. K. (2017). Deep learning based chatbot models. https://tdk.bme.hu/VIK/DownloadPaper/asdad. Budapest University of Technology and Economics, Scientific Students’ Associations Report.
[Kandasamy et al., 2017] Kandasamy, K., Bachrach, Y., Tomioka, R., Tarlow, D., and Carter, D. (2017). Batch policy gradient methods for improving neural conversation models. arXiv preprint arXiv:1702.03334.
[Li et al., 2015] Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. (2015). A diversitypromoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
[Li et al., 2016a] Li, J., Galley, M., Brockett, C., Spithourakis, G. P., Gao, J., and Dolan, B. (2016a). A persona-based neural conversation model. arXiv preprint arXiv:1603.06155.
[Li et al., 2016b] Li, J., Miller, A. H., Chopra, S., Ranzato, M., and Weston, J. (2016b). Dialogue learning with human-in-the-loop. arXiv preprint arXiv:1611.09823.
[Li et al., 2016c] Li, J., Monroe,W., Ritter, A., Galley, M., Gao, J., and Jurafsky, D. (2016c). Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.
[Li et al., 2017] Li, J., Monroe,W., Shi, T., Ritter, A., and Jurafsky, D. (2017). Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547.
[Serban et al., 2016] Serban, I. V., Sordoni, A., Bengio, Y., Courville, A. C., and Pineau, J. (2016). Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, pages 3776–3784.
[Serban et al., 2017] Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A. C., and Bengio, Y. (2017). A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pages 3295–3301.
[Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.
[Vinyals and Le, 2015] Vinyals, O. and Le, Q. (2015). A neural conversational model. arXiv preprint arXiv:1506.05869.
[Xing et al., 2017a] Xing, C., Wu, W., Wu, Y., Liu, J., Huang, Y., Zhou, M., and Ma, W.-Y. (2017a). Topic aware neural response generation. In AAAI, pages 3351–3357.
[Xing et al., 2017b] Xing, C.,Wu,W.,Wu, Y., Zhou, M., Huang, Y., and Ma,W.-Y. (2017b). Hierarchical recurrent attention network for response generation. arXiv preprint arXiv:1701.07149.
[Yu et al., 2017] Yu, Z., Black, A.W., and Rudnicky, A. I. (2017). Learning conversational systems that interleave task and non-task content. arXiv preprint arXiv:1703.00099.
[Zhao et al., 2017] Zhao, T., Lu, A., Lee, K., and Eskenazi, M. (2017). Generative encoderdecoder models for task-oriented spoken dialog systems with chatting capability. arXiv preprint arXiv:1706.08476.
[Zoph and Le, 2016] Zoph, B. and Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.