Skip to content

Index Specification Language

rh9ec edited this page Nov 16, 2016 · 1 revision

Previous SolrMarc Index Specification

For simple index specifications, the prior version of SolrMarc would handle entries like the following:

id = 001, first
author_text = 100abcdeq4:110abcde4:111acdejnq4:LNK100abcdeq4:LNK110abcde4:LNK111acdejnq4
oclc_display = 035a, (pattern_map.oclc_num)

Where there would be colon(:) separated lists of field_tags, with one or more subfield codes. This could be followed by an optional translation map, followed by an optional post-processing directive (first, join, or all)

It would extract the data from the specified field/subfields, and then apply the map (if present), and the post-processing directive. However if you wanted to do anything more complex than that, you needed to either invoke a pre-defined custom method, or define one of your own. Some examples are:

responsibility_statement_display = custom, removeTrailingPunct(245c)
title_facet = custom, getSortableTitle
journal_title_text = custom, getJournalTitleText(245a:LNK245a)

The new, richer index specification language

A major design goal of this new specification language is that it will accept everything that the previous version did, and produce essentially the same results given the same records. Another major goal is to be able to handle much more complex situations before a user needs to create a custom indexing method.

To that end, it supports adding one or more options after the field spec separated by commas, such as:

join    or  join(" : ") -- concatenate the matching subfields producing one output line per matching datafield
separate                -- do not concatenate matching subfields, instead output one line per matching subfield
format                  --
substring(start, end)   -- extract only a range of the data in the matching subfield
cleanEach               -- strip punctuation from each matching subfield
cleanEnd                -- strip punctuation after the subfields are concatenated
clean                   -- equal to specifying both cleanEnd and cleanEach
stripAccent             -- strip accent marks and diacritics from characters
stripPunct              -- strip all punctuation everywhere
stripInd2               -- strip a number of characters from the $a subfield equal to the digit in indicator2
toUpper                 -- convert all letters to uppercase
toLower                 -- convert all letters to lowercase
titleSortUpper          -- equal to cleanEach, cleanEnd, stripAccent, stripPunct, stripInd2, toUpper
titleSortLower          -- equal to cleanEach, cleanEnd, stripAccent, stripPunct, stripInd2, toLower
untrimmed               -- cause it to NOT call String.trim()

Using these it is possible to achieve similar results to the first two custom methods above

responsibility_statement_display = 245c, clean
title_facet = 245abk, titleSortLower, first

Additionally it now supports conditional qualifiers, that allow you to include certain fields/subfields only if certain conditions are true.

If you want to extract publication information from the 260abc subfields, or from the RDA-style 264abc subfields, but only wanted the 264 field if the second indicator is a "1" or a "4" then you can use the following conditional spec.

published_text = 260abc:264abc?(ind2 = '1' || ind2 = '4')

or if you wanted to extract "journal titles" from the 245 field, but only if the item was a "journal" that indicates this by an "s" in character 7 of the record leader.

journal_title_text = {245a:LNK245a} ? (000[7] = 's' )

The curly braces cause the conditional clause to be applied to all fields contained within.

If you wanted subject headings that conform to certain heading schemes (as indicated by the second indicator and the $2 subfield)

subject_text = {600[a-z]:610[a-z]:611[a-z]}?(ind2 != 7||(ind2 = 7 && $2 matches "fast|lcsh|tgn|aat"))