Traject::Marc4JReader is for JRuby only.
Traject::Marc4JReader is a reader for the traject ETL system
that allows the use of marc4j as a reader when dealing with MARC
binary or MARC-XML files. It is of no use outside of
traject run under JRuby.
It leverages marc-marc4j, which is a paper-thin wrapper around
.jar that is shipped with it.
The output of the reader is a vanilla ruby-marc object. You can hang onto the
original marc4j java object with the
Why use this?
The biggest reason would be for faster MARC/MARC-XML parsing and generation than the vanilla marc gem can provide, or if you need to do something wacky with the marc4j internal structure (such as feed it to legacy java code you have lying around).
In general, the marc4j library will parse marc21 (binary) and MARC-XML roughly twice as fast as the pure-ruby library. While MARC parsing tends to not be a huge part of the workload in a traject run, you'll almost certainly see performance gains.
Traject prior to 3.0 included this as a dependency on JRuby, and defaulted to using it.
In Traject 3.0+, you need to manually add this gem and configure to use it.
If you are using bundler and a
gem "traject-marc4j_reader", "~> 1.0" to your
Gemfile. Otherwise, just
gem install traject-marc4j_reader.
Then, in your traject config file:
# Instead of require in config file, you could use the `-r` traject # command-line option. require 'traject/marc4j_reader' settings do provide "reader_class_name", "Traject::Marc4JReader" # Recommend marc4j_reader.permissive true unless you have reason not to. # true was default provided by core traject gem in Traject pre-3.0, but isn't # anymore in traject 3.0 -- so set to true explicitly to maintain behavior # # Only relevant for binary MARC source data. provide "marc4j_reader.permissive", true end
For more about the traject
settings object, see the traject settings documentation
Note that the standard Marc4JReader always converts to UTF8, so output will always reflect that conversion.
marc4j.jar_dir: Path to a directory containing Marc4J jar file to use. All .jar's in dir will be loaded. If unset, uses marc4j.jar bundled with traject.
marc4j_reader.permissive: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default false, but recommend true for most uses.
marc4j_reader.source_encoding: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC"
marc4j_reader.keep_marc4j: After translating the marc4j record into a normal ruby-marc object, provides access to the former via
'marc4j_reader.class': Set to eg 'MarcStreamReader' to use that more strict Marc4J reader class, instead of the default Marc4J
A simple example that reads in via marc4j and outputs to the newline-delimited-json writer.
Use would be:
traject -c id_title.rb my_marc_file.mrc
# File id_title.rb require 'traject' require 'traject/marc4j_reader' require 'traject/json_writer' require 'traject/macros/marc21_semantics' extend Traject::Macros::Marc21Semantics settings do provide "reader_class_name", "Traject::Marc4JReader" provide "marc4j_reader.keep_marc4j", true provide "writer_class_name", "Traject::JsonWriter" provide "output_file", "ids_and_titles.ndj" end to_field "id", extract_marc("001", :first => true) to_field "title", extract_marc_filing_version('245abdefghknp', :include_original => true)
- Fork it ( https://github.com/[my-github-username]/traject_marc4j_reader/fork )
- Create your feature branch (
git checkout -b my-new-feature)
- Commit your changes (
git commit -am 'Add some feature')
- Push to the branch (
git push origin my-new-feature)
- Create a new Pull Request