Skip to content
This repository has been archived by the owner on Jan 12, 2019. It is now read-only.

tmanabe/HEPS-data-set

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HEPS data set

All the details are in our paper.

It contains

  • Training data set
  • Test data set
  • Agreement data set annotated by B
  • A few other pages

It does not contain

  • Raw strings (because of copyright problem)
  • Pages with problems (as far as we found), e.g.,
    • Pages currently not downloadable from Internet Archive
    • Pages that download the latest contents from outside the archive

download.rb

  • downloads and generates raw strings and HTML files containing only content bodies.
  • Usage:
$ ruby download.rb <path_to_PhantomJS_binary> ./data-set ./html-dir
  • It is developed using:
    • CentOS release 6.5
    • Ruby 2.1.2p95
    • PhantomJS 2.0.1-development

Note

  • Mandatory attribute value of a range is false iff the range is a transition.

Link

About

a data set for heading-based page segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published