Skip to content

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization

Notifications You must be signed in to change notification settings

sarmilaupadhyaya/cnn-dailymail

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This fork modifies the preprocessed output to JSON format to allow using non-Tensorflow libraries to work with the CNN/DailyMail summarization dataset

Note: requires Python 3

This fork is primarily developed in order to work with this repository which uses pytorch

--

1. Download data

Download and unzip the stories directories from here for both CNN and Daily Mail.

Warning: These files contain a few (114, in a dataset of over 300,000) examples for which the article text is missing - see for example cnn/stories/72aba2f58178f2d19d3fae89d5f3e9a4686bc4bb.story. The PyTorch code works fine with it unless in an extreme case such that all data sampled in a batch is empty.

2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-full-2016-10-31 directory. You can check if it's working by running

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

You should see something like:

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

3. Process into JSON files (packed into tarballs) and vocab_cnt files (python pickle)

Run

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories

replacing /path/to/cnn/stories with the path to where you saved the cnn/stories directory that you downloaded; similarly for dailymail/stories.

This script will do several things:

  • The directories cnn_stories_tokenized and dm_stories_tokenized will be created and filled with tokenized versions of cnn/stories and dailymail/stories. This may take some time. Note: you may see several Untokenizable: warnings from Stanford Tokenizer. These seem to be related to Unicode characters in the data; so far it seems OK to ignore them.
  • For each of the url lists all_train.txt, all_val.txt and all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to tarball files train.tar, val.tar and test.tar. These will be placed in the newly-created finished_files directory. This may take some time.
  • Additionally, a vocab_cnt.pkl file is created from the training data. This is also placed in finished_files. This is a python Counter of all words, which could be useful for determining the vocabulary by word appearance count.

About

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%