Skip to content
Evaluate your dialog model with 17 metrics! (see paper)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs fixed twitter Jul 23, 2019
LICENSE Initial commit May 30, 2019 Update Aug 4, 2019

dialog-eval · twitter

Paper Poster Code1 Code2 documentation blog
A lightweight repo for automatic evaluation of dialog models using 17 metrics.


🔀   Choose which metrics you want to be computed
🚀    Evaluation can automatically run either on a response file or a directory containing multiple files
💾   Metrics are saved in a pre-defined easy to process format
⚠️   The program warns you if some files required to compute specific metrics are missing


  • Response length: Number of words in the response.
  • Per-word entropy: Probabilities of words are calculated based on frequencies observed in the training data. Entropy at the bigram level is also computed.
  • Utterance entropy: The product of per-word entropy and the response length. Also computed at the bigram level.
  • KL divergence: Measures how well the word distribution of the model responses approximates the ground truth distribution. Also computed at the bigram level (with bigram distributions).
  • Embedding: Embedding average, extrema, and greedy are measured. average measure the cosine similarity between the averages of word vectors of response and target utterances. extrema constructs a representation by taking the greatest absolute value for each dimension among the word vectors in the response and target utterances and measures the cosine similarity between them. greedy matches each response token to a target token (and vica versa) based on the cosine similarity between their ebeddings and averages the total score across all words.
  • Coherence: Cosine similarity of input and response representations (constructed with the average word embedding method).
  • Distinct: Distinct-1 and distinct-2 measure the ratio of unique unigrams/bigrams to the total number of unigrams/bigrams in a set of responses.
  • BLEU: Measures n-gram overlap between response and target (n = [1,2,3,4]). Smoothing method can be choosen in the arguments.


Run this command to install required packages:

pip install -r requirements.txt


The main file can be called from anywhere, but when specifying paths to directories you should give them from the root of the repository.

python code/ -h

For the complete documentation visit the wiki.

Input format

You should provide as many of the argument paths required (image above) as possible. If you miss some the program will still run, but it will not compute some metrics which require those files (it will print these metrics). If you have a training data file the program can automatically generate a vocabulary and download fastText embeddings.

If you don't want to compute all the metrics you can set which metrics should be computed in the config file very easily.

Saving format

A file will be saved to the directory where the response file(s) is. The first row contains the names of the metrics, then each row contains the metrics for one file. The name of the file is followed by the individual metric values separated by spaces. Each metric consists of three numbers separated by commas: the mean, standard deviation, and confidence interval. You can set the t value of the confidence interval in the arguments, the default is for 95% confidence.

Results & Examples

Transformer trained on DailyDialog

Interestingly all 17 metrics improve until a certain point and then stagnate with no overfitting occuring during the training of a Transformer model on DailyDialog. Check the appendix of the paper for figures.

TRF is the Transformer model evaluated at the validation loss minimum and TRF-O is the Transformer model evaluated after 150 epochs of training, where the metrics start stagnating. RT means randomly selected responses from the training set and GT means ground truth responses.

Transformer trained on Cornell

TRF is the Transformer model, while RT means randomly selected responses from the training set and GT means ground truth responses. These results are on measured on the test set at a checkpoint where the validation loss was minimal.

Transformer trained on Twitter

TRF is the Transformer model, while RT means randomly selected responses from the training set and GT means ground truth responses. These results are on measured on the test set at a checkpoint where the validation loss was minimal.


Check the issues for some additions where help is appreciated. Any contributions are welcome ❤️
Please try to follow the code syntax style used in the repo (flake8, 2 spaces indent, 80 char lines, commenting a lot, etc.)

New metrics can be added by making a class for the metric, which handles the computation of the metric given data. Check BLEU metrics for an example. Normally the init function handles any data setup which is needed later, and the update_metrics updates the metrics dict using the current example from the arguments. Inside the class you should define the self.metrics dict, which stores lists of metric values for a given test file. The names of these metrics (keys of the dictionary) should also be added in the config file to self.metrics. Finally you need to add an instance of your metric class to self.objects. Here at initialization you can make use of paths to data files if your metric requires any setup. After this your metric should be automatically computed and saved.

However, you should also add some constraints to your metric, e.g. if a file required for the computation of the metric is missing the user should be notified, as here.



This project is licensed under the MIT License - see the LICENSE file for details.
Please include a link to this repo if you use it in your work and consider citing the following paper:

    title = "Improving Neural Conversational Models with Entropy-Based Data Filtering",
    author = "Cs{\'a}ky, Rich{\'a}rd and Purgai, Patrik and Recski, G{\'a}bor",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "",
    pages = "5650--5669",
You can’t perform that action at this time.