Benchmark results on conversational response selection

This page contains benchmark results for the baselines and other methods on the datasets contained in this repository.

Please feel free to submit the results of your model as a pull request.

All the results are for models using only the context feature to select the correct response. Models using extra contexts are not reported here (yet).

For a description of the baseline systems, see baselines/README.md.

Reddit

These are results on the data from 2015 to 2018 inclusive, (TABLE_REGEX="^201[5678]_[01][0-9]$")$").

	1-of-100 accuracy
Baselines
TF_IDF	26.4%
BM25	27.5%
USE_SIM	36.6%
USE_MAP	40.8%
USE_LARGE_SIM	41.4%
USE_LARGE_MAP	47.7%
ELMO_SIM	12.5%
ELMO_MAP	20.6%
Other models
PolyAI-Encoder [1]	61.3%

OpenSubtitles

	1-of-100 accuracy
Baselines
TF_IDF	10.9%
BM25	10.9%
USE_SIM	13.6%
USE_MAP	15.8%
USE_LARGE_SIM	14.9%
USE_LARGE_MAP	18.0%
ELMO_SIM	9.5%
ELMO_MAP	13.3%
Other models
PolyAI-Encoder [1]	30.6%

AmazonQA

	1-of-100 accuracy
Baselines
TF_IDF	51.8%
BM25	52.3%
USE_SIM	47.6%
USE_MAP	54.4%
USE_LARGE_SIM	51.3%
USE_LARGE_MAP	61.9%
ELMO_SIM	16.0%
ELMO_MAP	35.5%
Other models
PolyAI-Encoder [1]	84.2%

References

[1] A Repository of Conversational Datasets. Henderson et al. arXiv pre-print 2019.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCHMARKS.md

BENCHMARKS.md

Benchmark results on conversational response selection

Reddit

OpenSubtitles

AmazonQA

References

Files

BENCHMARKS.md

Latest commit

History

BENCHMARKS.md

File metadata and controls

Benchmark results on conversational response selection

Reddit

OpenSubtitles

AmazonQA

References