$ bundle install
$ bundle exec rake
Can be run using MRI or jruby
SOLR_VERSION=6.6.5 NUM_THREADS=1 SOLR_URL=http://127.0.0.1:8983/solr/blacklight-core bundle exec traject -c lib/traject/config/sirsi_config.rb uni_00000000_00499999.marc
This codebase sets up custom settings that are used internally beyond what traject provides.
Setting | Description | Default |
---|---|---|
skip_empty_item_display |
Can be provided via an env variable SKIP_EMPTY_ITEM_DISPLAY which tells the sirsi traject code to skip or not skip empty item_display fields. Anything greater than -1 will skip. Test are set to use -1 unless otherwise configured |
0 |
MARC binary data is dumped into files (each containing ~500k records) on the Symphony servers. These dumps happen hourly (containing every changed MARC record during that calendar day), nightly (containing every record changed the previous day), and monthly (containing every exportable MARC record in Symphony). Hourly and nightly dumps also include a del
file, containing a catkey-per-line of records that have been deleted or retracted.
Additional data comes from a course reserves data dump, also on the Symphony servers. Course reserves files are pipe |
separated values (PSV) files which are read in during the indexing process and used to enhance MARC records during the transform process.
The indexing machines have scheduled cron tasks (see ./config/schedule.rb
) that retrieve this data from the Symphony servers and process the data into a kafka topic. Messages in the topic are key-value pairs; the key is the catkey of the record, and the value is either blank (representing a delete) or containing one or more binary MARC records for the catkey. The topics are regularly compacted by Kafka to remove duplicate data.
There are daemon processes managed by eye (see ./traject.eye
locally, and ./config/settings.yml
on the deployed servers) that continously consume the kafka topics and run the traject indexing configuration on the data.
Traject can also index one or more MARC record directly. An example of a traject command used for indexing:
$ SOLR_URL=http://www.example.com/solr/collection-name NUM_THREADS=4 bundle exec traject -c lib/traject/config/sirsi_config.rb /path/uni_00000000_00499999.marc
The indexing machines also have schedule cron tasks (again, see ./config/schedule.rb
) for loading data from purl-fetcher into a kafka topic. This task records a state file (in ./tmp
) that contains the timestamp of the most recent entry from purl-fetcher that was processed. Every minute, the cron task runs, retrieves the purl-fetcher changes since that most recent timestamp, and adds the message to a kafka topic.
A daemon process managed by eye continuously consumes the kafka topic to run the traject indexing configuration on the data.