This is a project to fetch couplets from 冯重朴_梨味斋散叶_的博客
This dataset contains more than 700,000 couplets.
Run the spider:
scrapy runspider sina_spider.py
It will store the data into
Download the data
There is an already fetched and cleaned dataset that can be used directly with the seq2seq model. You can download it at here.
The downloaded data contains 5 files:
train/in.txt: The input of the couplets. Each line is an input. Each word is split by space.
train/out.txt: The output of the couplets. Each line is the output for the same line in the
in.txt. Each word is split by space.
test/in.txt: Same as
train/in.txtbut with less data.
test/out.txt: Same as
train/out.txtbut with less data.
vocabs: Vocabs file. Add
<\s>as the first vocabs, which will be used to train in the seq2seq mode.