Dataset for couplets. 70万条对联数据库。
Clone or download
Latest commit 4639570 Feb 24, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore spider to fetch couplet data on sina blog Jul 20, 2017
LICENSE Create LICENSE Feb 24, 2018
README.md add data format Feb 24, 2018
sina_spider.py fix unicode for python 3 Jul 20, 2017

README.md

对联数据集。

This is a project to fetch couplets from 冯重朴_梨味斋散叶_的博客

This dataset contains more than 700,000 couplets.

Run the spider:

scrapy runspider sina_spider.py

It will store the data into ./output/.

Download the data

There is an already fetched and cleaned dataset that can be used directly with the seq2seq model. You can download it at here.

The downloaded data contains 5 files:

  1. train/in.txt: The input of the couplets. Each line is an input. Each word is split by space.
  2. train/out.txt: The output of the couplets. Each line is the output for the same line in the in.txt. Each word is split by space.
  3. test/in.txt: Same as train/in.txt but with less data.
  4. test/out.txt: Same as train/out.txt but with less data.
  5. vocabs: Vocabs file. Add <s> and <\s> as the first vocabs, which will be used to train in the seq2seq mode.