Skip to content

Basic field based extraction specifications

haschart edited this page Nov 20, 2016 · 1 revision

The syntax for specifying what fields/subfields (or what portion of a field or subfield) is to be looked-at to create the Solr index field(s) consists of one or more field specifications separated by colons (:).

A field specification consists of a three-digit string (000 – 999) optionally followed by characters indicating which subfields and/or bytes to use.

  • no subfields specified, e.g 100 - all subfields of the specific Marc field, in order of occurrence in the Marc record, will be concatenated into a single value. Each occurrence of the Marc field will create separate instance of the Solr field in the Solr document.

  • single letter after the field, e.g. 041a - for each occurrence of the Marc field, each occurrence of the subfield will create a Solr field instance of the contents of the subfield.

  • same single letter after the field repeated, e.g. 650aa - for each occurrence of the Marc field, one space separated concatenation of all occurrences of the subfield will be a Solr field instance in the Solr doc.

  • multiple letters after the field, e.g. 100abcdq - for each occurrence of the field, all the indicated subfields, in order of occurrence in the Marc record, will be concatenated into a single value. Each occurrence of the Marc field will create separate instance of the Solr field in the Solr document.

  • square brackets containing digit pattern for a fixed length field (i.e. leader, 001-009), , e.g. 008[35-37] or 000[5] - the digits in the brackets indicate the characters to be used as a value. The counting is 0-based: the first byte in the fixed field is 0. 008[35-37] will return the three character sequence at bytes 35,36,37 in the 008 field. Each instance of the Marc field (e.g. for an 007, which is repeatable) will create a separate instance of the Solr field in the Solr document.

  • square brackets indicating a regular expression describing subfields for a variable length field, , e.g. 110[a-z] or 243[a-gk-s]. for each occurrence of the field, all the indicated subfields, in order of occurrence in the Marc record, will be concatenated into a single value. Each occurrence of the Marc field will create separate instance of the Solr field in the Solr document.

There are additional ways to generate Solr fields from your MARC data explained below.

Example Field Specifications

full_title_display = 245

for each MARC 245 field, concatenate all subfield values, separated by a space, then add a field named full_title_display to the Solr document with the concatenated value. Note that there is a single Solr field occurrence for each specified MARC field.

brief_title_display = 245a

for each subfield a in each of the 245 fields in the MARC record, add a field to the Solr document named brief_title_display, with the value in from the MARC 245 subfield a. Note that there is a single Solr field occurrence for each specified MARC subfield in the MARC field. Aside: since the MARC specification states that there can only be a single 245 field and only a single subfield a, the results of this specification will be identical to what they would be if the field specification wasbrief_title_display = 245a, first.

author_text = 100a:110a:111a:130a

for each 1) subfield a in each of the 100 fields 2) subfield a in each of the 110 fields 3) each subfield a in each of the 111 fields 4) subfield a in each of the 130 fields in the MARC record, add a field to the Solr document named author_text with the value from the MARC subfield specified. Note that each of these values is added as a separate Solr field in the Solr document.

fruit_text = 999a

Given a record that contained two 999 fields as follows:

999 $a apricot 999 $a apple $b banana $a aardvark

would produce: fruit_text:apricot and fruit_text:apple and fruit_text:aardvark

all_fruit_text = 999ab

Given a record that contained two 999 fields as follows:

999 $a apricot 999 $a apple $b banana $a aardvark

would produce: all_fruit_text:apricot and all_fruit_text:apple banana aardvark

author_addl_t = 700abcegqu:710abcdegnu:711acdegjnqu

for each occurrence of a specified MARC field (700, 710 and 711), concatenate the values of the specified subfields, with a space separator, then add a field named author_addl_t to the Solr document with the concatenated value. Note that there is a single Solr field occurrence for each specified MARC field.

material_type_display = 300aa

for each MARC 300 field, concatenate all subfield a values, separated by a space, then add a field named material_type_display to the Solr document with the concatenated value. Note that there is a single Solr field occurrence for each MARC field, not each MARC subfield.

title_added_entry_t = 700[gk-pr-t]:710[fgk-t]:711fgklnpst:730[a-gk-t]:740anp

The subfields specified here are regular expressions. This field spec is equivalent to title_added_entry_t = 700gklmnoprst:710fgklmnopqrst:711fgklnpst:730abcdefgklmnopqrst]:740anp

language_facet = 008[35-37]:041a:041d, language_map.properties

This field specification states that characters 35 through 37 should be selected from each 008 control field of the MARC record (which is where a three-letter encoding of the primary language of a bibliographic work is found.) Additionally, each occurrence of the a and d subfields of all 041 fields in the MARC record become individual Solr fields named language_facet in the Solr document.

Note that a second parameter is present on the field specification entry: language_map.properties. If this optional parameter is present, once the set of strings is created for all of the fields and subfields specified in the first parameter, the entire set is translated using the translation map that is defined in the separate property file named language_map.properties, (which maps the three-letter abbreviations for languages to the full name of that language; Hence "eng" becomes "English," "fre" becomes "French," "chp" becomes "Chipewyan," and "peo" becomes the ever-popular "Old Persian (ca. 600-400 B.C.)." The details of how to define a translation map is covered in the next section

broad_format_facet = 000[6]:007[0], format_maps.properties(broad_format), first

This field specification states that the value of character 6 (counting from 0) of field 000 (which stands for the leader of the MARC record) and character 0 of field 007 are to be extracted. Both of these values are to then be translated using the translation map that is defined in the separate property file named format_maps.properties, by loading all the entries there that start with the string broad_format. The first translated value is to be used as the value for the Solr index entry.

Or to put it more succinctly, look up character 6 of the 000 field in the map broad_format, if the map contains a mapping for that character, use that value; otherwise, look up character 0 of the 007 field in the map broad_format, if the map contains a mapping for that character use that value. If neither extracted value matches an entry in the translation map, check to see whether the map defines a default value, if so use that default value, otherwise leave the broad_format_facet index entry unassigned.