https://github.com/thammegowda/virtchar
This repository has code to create chat bots by training them on TV transcripts.
See requirements.txt
for the required libraries and versions.
There are two kinds of models:
- Retrieval based models: See docs/retrieval-bot.md
- Generator models: See docs/neural-generator-bot.md
A Retrieval bot uses InferSent model to understand sentences (i.e. to obtain sentence representation)
A generator bot has two choices: Hierarchical Transformer and Hierarchical RNN. If you dont want to use hierarchical models, then head over to Tensor2Tensor or OpenNMT-Py or RTG (if I made it public or gave you access) Stick to this repository for hierarchical model.
Retrieval based bot is easy to get to working and it is fun, so you should start there (see docs/retrieval-bot.md)
Hierarchical NLU based generator models have lots of issues, and requires lot of effort to get them to work.
Specifically you will hit these issues:
- They need lots of data and time to train.
- They (along with non-hierarchical versions) tend to produce "i dont know", "yes" "No", kind of short answers.
For the first problem, train on a huge corpus (out of domain) and fine tune on the desired corpus
(see finetune_dialogs
in config files, and --fine-tune
option to trainer to make a switch).
Train these on GPU to speed up (pytorch is under the hood).
The second problem is more harder/interesting on its own. Its the place where MLE assumption breaks. See this paper to know why this is hard. For now, this project uses a simple technique of down sampling the utterances(its enabled by default)
- Short enquiries : Sent them to me
- Long discussions and bugs: Create an issue or pull request on this repo