Training and test data to accompany Moz's content extraction algorithms, Dragnet. For details about the algorithms and code, see the Dragnet homepage.
NOTE: While the Dragnet code and trained models are licensed under the MIT license, this data is licensed under the AGPLv3. This means, among other things, that any derived works from the data must also be open sourced, even if they are provided as a service. Our intention here is to freely provide for research/non-commercial purposes, and to allow commercial use as long as it is open sourced.
The data was collected in 2012 by Kurtis Bohrnstedt.
git clone https://github.com/seomoz/dragnet_data.git cd dragnet_data tar xvf dragnet_HTML.tar.gz tar xvf dragnet_Corrected.tar.gz
Details about the data
A training data set consists of a collection of web pages and the extracted
"gold standard" content. For our purposes we standardize
a data set as a set of files on disk with a specific directory and naming
convention. Each training example is all the data associated
with a single web page and all data for all examples lives under
ROOTDIR (typically the root of this repository).
Each training example is identified by a common file root.
The data for example
X lives in a set of sub-directories as follows:
$ROOTDIR/HTML/contains the raw HTML named
$ROOTDIR/Corrected/contains the extracted content named
The "Corrected" files separate the main article from comments with the
Any text appearing before this string
in the file is the main article, text after belongs to comments.
test.txt list the files in the training and test
Additional data sources
Tim Weninger provides the data used in his paper at
"CETR -- Content Extraction with Tag Ratios" (WWW 2010)
(scroll to the bottom for a link to their data).
We used the bash script
cetr_to_dragnet.sh to convert the data from CETR to Dragnet format. In using their data,
we had to remove a small number of documents (less then 15) since they were so malformed
libxml2 could not parse them. We also found some systematic problems with the data in the
myriad data sets so decided not to use them. For example,
many of the HTML files in
cleaneval-zh contain several
</html> tags, followed immediately
<DOCTYPE ..> tags that libxml2 bonks out on. Many of the gold standard files
myriad data contain significant portions of duplicated content that is not
present in the HTML document that we cannot use without a lot of manual cleanup.
Creating your own training data
You can easily create your own training data:
- Create a directory hierarchy as detailed above (
- Save HTML files to the directory to be used as training examples. This is the raw HTML from crawling the page or "Save as.." in a browser.
- Extract the content from the
htmlfiles into the
- Cut and paste any content into the
Correctedtext file. If there are any comments, then separate the comments from the main article content in the text file with the string
!@#$%^&*() COMMENTSon its own line.
- Give your data back to the research community so everyone can benefit :-)