This repository has been archived by the owner. It is now read-only.
Extract SGML documents into plain text files
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README
extractor.rb
small

README

Documents must be wrapped in
<CORPUS>
...
</CORPUS>

for nokogiri to interpret as XML