Skip to content

scirate/arxivsync

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arxivsync 0.2.0

Ruby OAI interface for harvesting the arXiv. Can be used to store and update an XML mirror of paper metadata, and parse the XML into Ruby objects to allow conversion into a friendlier format.

Installation

  gem install arxivsync

Usage

Creating or updating an archive

Use the included shell command:

  arxivsync ARCHIVE_DIR

This stores each XML response as an individual file, each containing up to 1000 records. Following an initial harvest, you can rerun this to add additional files containing all records since the last harvest.

Remember to leave at least a day between syncs-- the temporal granularity doesn't go any smaller than that!

Reading from an archive

  archive = ArxivSync::XMLArchive.new("/home/foo/savedir")
  archive.read_metadata do |papers|
    # Papers come in blocks of at most 1000 at a time
    papers.each do |paper|
      # Do stuff with papers
    end
  end

Parses the XML files using a SAX parser and yields Structs representing the metadata as it goes. The structures returned will closely match the arxivRaw format.

Download and parse immediately

If you just want arxivsync to do the request-cycle and parsing bits but handle storage yourself:

  ArxivSync.get_metadata(oai_params) do |resp, papers|
    papers.each do |paper|
      # Do stuff with paper
    end
  end

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

About

Ruby gem OAI interface for harvesting the arXiv database

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Ruby 100.0%