All the details are in our paper.
- Training data set
- Test data set
- Agreement data set annotated by B
- A few other pages
- Raw strings (because of copyright problem)
- Pages with problems (as far as we found), e.g.,
- Pages currently not downloadable from Internet Archive
- Pages that download the latest contents from outside the archive
- downloads and generates raw strings and HTML files containing only content bodies.
- Usage:
$ ruby download.rb <path_to_PhantomJS_binary> ./data-set ./html-dir
- It is developed using:
- CentOS release 6.5
- Ruby 2.1.2p95
- PhantomJS 2.0.1-development
- Mandatory attribute value of a range is false iff the range is a transition.