Conversation Initiation Dataset

In our everyday chit-chat, there is a conversation initiator, who proactively casts an initial utterance to start chatting. However, most existing conversation systems cannot play this role. Previous studies on conversation systems assume that the user always initiates conversation, and have placed emphasis on how to respond to the given user's utterance. As a result, existing conversation systems become passive. Namely they continue waiting until being spoken to by the users. To tackle this problem, we created a large-scale dataset for training and evaluating conversation initiation models through a crowd-sourcing service. Here, we consider a task setting in which the system initiates a conversation by talking about a news topic. In this setting, the system is provided with a news post to talk about and uses it to generate the initial utterance of the conversation.

License

Creative Commons Attribution 4.0 License

Dataset Description

Input (news contents)

src_*.tsv: are input files used as source sentences for training and testing an encoder-decoder based conversation model. This file contains tweet IDs of @YahooNewsTopics. You should replace each line with an original news post (You can extract the corresponding news post using https://twitter.com/YahooNewsTopics/status/XXXXXXXXX; XXXXXXXXX is tweet ID).
- You have to remove URLs and first tokens surrounded by "【" and "】" from the tweets.
- If you have troubles collecting tweets, please contact the author.

Output (summarization and chit-chat)

tgt-sep_*.tsv: are output files used as target sentences for training and testing an encoder-decoder based conversation model. These files can be used for developing Separate model (denoted as Separate(Gen) or Separate(Gen+MMI) in NAACL paper), first column is a summarization part and second is a chit-chat part.
tgt-joint_*.tsv: are output files used as target sentences for training and testing an encoder-decoder based conversation model. These files can be used for developing Joint model (denoted as Joint in NAACL paper).

Note that all the files must be tokenized using MeCab ver. 0.996 with ipadic dictionary. You can easily do tokenization by using -O option. For installation, this document might be useful.

Citation

@inproceedings{akasaki2018ci,
  title     = {Conversation Initiation by Diverse News Contents Introduction},
  author    = {Satoshi Akasaki and Nobuhiro Kaji},
  booktitle = {Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  pages     = {to appear},
  year      = {2019},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
src_test.tsv		src_test.tsv
src_train.tsv		src_train.tsv
src_valid.tsv		src_valid.tsv
tgt-joint_test.tsv		tgt-joint_test.tsv
tgt-joint_train.tsv		tgt-joint_train.tsv
tgt-joint_valid.tsv		tgt-joint_valid.tsv
tgt-sep_test.tsv		tgt-sep_test.tsv
tgt-sep_train.tsv		tgt-sep_train.tsv
tgt-sep_valid.tsv		tgt-sep_valid.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conversation Initiation Dataset

License

Dataset Description

Input (news contents)

Output (summarization and chit-chat)

Citation

About

Releases

Packages

yahoojapan/yj-ci-dataset

Folders and files

Latest commit

History

Repository files navigation

Conversation Initiation Dataset

License

Dataset Description

Input (news contents)

Output (summarization and chit-chat)

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages