SearchworksTrajectIndexer

Installation

$ bundle install

Running test suite

$ bundle exec rake

Running Traject indexer

Can be run using MRI or jruby

SOLR_VERSION=6.6.5 NUM_THREADS=1 SOLR_URL=http://127.0.0.1:8983/solr/blacklight-core bundle exec traject -c lib/traject/config/sirsi_config.rb uni_00000000_00499999.marc

Custom settings

This codebase sets up custom settings that are used internally beyond what traject provides.

Setting	Description	Default
`skip_empty_item_display`	Can be provided via an env variable `SKIP_EMPTY_ITEM_DISPLAY` which tells the sirsi traject code to skip or not skip empty item_display fields. Anything greater than -1 will skip. Test are set to use `-1` unless otherwise configured	`0`

Indexing Strategies

Sirsi

MARC binary data is dumped into files (each containing ~500k records) on the Symphony servers. These dumps happen hourly (containing every changed MARC record during that calendar day), nightly (containing every record changed the previous day), and monthly (containing every exportable MARC record in Symphony). Hourly and nightly dumps also include a del file, containing a catkey-per-line of records that have been deleted or retracted.

Additional data comes from a course reserves data dump, also on the Symphony servers. Course reserves files are pipe | separated values (PSV) files which are read in during the indexing process and used to enhance MARC records during the transform process.

The indexing machines have scheduled cron tasks (see ./config/schedule.rb) that retrieve this data from the Symphony servers and process the data into a kafka topic. Messages in the topic are key-value pairs; the key is the catkey of the record, and the value is either blank (representing a delete) or containing one or more binary MARC records for the catkey. The topics are regularly compacted by Kafka to remove duplicate data.

There are daemon processes managed by eye (see ./traject.eye locally, and ./config/settings.yml on the deployed servers) that continously consume the kafka topics and run the traject indexing configuration on the data.

Traject can also index one or more MARC record directly. An example of a traject command used for indexing:

$ SOLR_URL=http://www.example.com/solr/collection-name NUM_THREADS=4 bundle exec traject -c lib/traject/config/sirsi_config.rb /path/uni_00000000_00499999.marc

SDR

The indexing machines also have schedule cron tasks (again, see ./config/schedule.rb) for loading data from purl-fetcher into a kafka topic. This task records a state file (in ./tmp) that contains the timestamp of the most recent entry from purl-fetcher that was processed. Every minute, the cron task runs, retrieves the purl-fetcher changes since that most recent timestamp, and adds the message to a kafka topic.

A daemon process managed by eye continuously consumes the kafka topic to run the traject indexing configuration on the data.

Name		Name	Last commit message	Last commit date
Latest commit History 766 Commits
config		config
lib		lib
log		log
script		script
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.travis.yml		.travis.yml
Capfile		Capfile
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENCE		LICENCE
README.md		README.md
Rakefile		Rakefile
traject.eye		traject.eye

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SearchworksTrajectIndexer

Installation

Running test suite

Running Traject indexer

Custom settings

Indexing Strategies

Sirsi

SDR

About

Releases

Packages

Languages

License

ryanmax/searchworks_traject_indexer

Folders and files

Latest commit

History

Repository files navigation

SearchworksTrajectIndexer

Installation

Running test suite

Running Traject indexer

Custom settings

Indexing Strategies

Sirsi

SDR

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages