Skip to content

qywu/DialogCorpus

Repository files navigation

DialogCorpus

A large scale dialog corpus for training the Next-Gen Dialog System.

How to Use?

First download the repository.

# download
git clone https://github.com/qywu/DialogCorpus.git
cd DialogCorpus

You can manually download and process the dataset.

# download data for daily_dialog
python daily_dialog/download_data.py
# process the data
python daily_dialog/process_data.py
# the processed data is stored as the {folder_name}.json
vi daily_dialog/data/daily_dialog.json

Or you can just use one command.

python prepare_all_data.py \
       --download \
       --process \
       --join

Detailed Dialog Processing for each dataset:

  • Daily Dialog

    • Removed tokenization space for punctuations
  • Persona Chat

    • Used huggingface's version [link]
    • Recovered lower cased utterances
    • Removed tokenization space for punctuations
  • Cornell Movie Corpus

    • Ignored UTF-8 Errors
    • Extracted Names
  • Task Master

    • Nothing specific
  • CCPE

    • Nothing specific
  • Frames

    • Nothing specific
  • Chit-Chat Challenge

    • Nothing specific
  • Self-dialogue

    • Nothing specific
  • Schema Dialog

    • Nothing specific

Links

About

A large scale dialog corpus for pre-training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages