Skip to content

Translation maps

Demian Katz edited this page Sep 10, 2019 · 2 revisions

Mapping "Raw" Values to New Values

The process of creating a translation map to translate from the cryptic entries found in the MARC record to more human-readable strings to make searching and faceting more useful to the end-user of the system, is fairly straightforward. The first thing that you must do is add a second parameter on the field specification entry in the properties file, as shown in the last two of the examples shown above. This parameter specifies either the name of a separate property file that contains the map or the name of a separate property file plus the name of the property key prefix that should be looked for in that file. For the last example above, a map named broad_format is referenced in the properties file format_maps.properties. Those entries in that file that start with the string broad_format will be used to define the map. The definition for that map:

broad_format.v = Video
broad_format.a = Book
broad_format.t = Book
broad_format.m = Computer File
broad_format.c = Musical Score
broad_format.d = Musical Score
broad_format.j = Musical Recording
broad_format.i = Non-musical Recording
broad_format = Unknown

Note that Unknown is the default value for the translation map.

Each line defining a translation starts with the name of the map, followed by a period, followed by the string that is to be replaced. Next there must be an equals sign, followed by the string that should be used to create the replacement. Note that it is possible to have several different strings be mapped to the same result (as shown for Book or Musical Score), but it is not possible to have the same string to map to two different results. If, for instance, in this specific example, which looks at position 6 from the MARC leader, and position 0 of field 007, if you decided that you wanted to include a mapping for the character r in position 6 of the leader to Three-dimensional artifact and also include a mapping for character r of position 0 of field 007 to Remote-sensing image, you could not accomplish this using a field specification and a translation. Instead you would have to create a custom indexing function.

Note also, that if no mapping is present for a given input, then no value will be entered for that particular index entry in the Solr index record. This fact can be exploited in conjunction with the first field specification command as is shown in the following example:

music_catagory_facet = 999a[0-1]:999a[0], music_maps.properties(music_catagory), first

music_catagory.ML = Music Literature
music_catagory.MT = Music Theory
music_catagory.M2 = Monuments of Music
music_catagory.M3 = Composers' Collected Works
music_catagory.M = Printed Music

In this example, the first two characters of the 999a subfield are extracted, if these two characters are ML, MT, M2 or M3, then the translation map will return the value corresponding to those values. If the value doesn’t match one of those four strings, then the translation map will return null, and the next step in the specification will be processed; it will take only the first character of the ‘999a’ subfield, and pass that to the translation map, which then can check against the single letter M using the fifth map entry. If the value matches, then Printed Music will be used for the Solr index field entry, otherwise no value will be used for the music_catagory_facet field of the Solr index record.

Lastly, note that the process of winnowing out duplicate entries takes place both before the translation map is applied, and again while collecting the results from applying the translation map. So if the following map were applied:

recording_format.MUSIC-CD = CD
recording_format.RSRV-CD = CD
recording_format.AUDIO-CD = CD
      
recording_format.AUDIO-CASS = Cassette
recording_format.MUSIC-CASS = Cassette
recording_format.RSRV-CASS = Cassette
recording_format.RSRV-AUD = Cassette
recording_format.RSRV-CAS2D = Cassette
      
recording_format.DVD = DVD 
recording_format.HS-VDVD = DVD 
recording_format.HS-VDVD3 = DVD 
recording_format.RSRV-VDVD = DVD 
      
recording_format.LP = LP
recording_format.IVY-LP = LP
recording_format.MUSIC-LP = LP
recording_format.OPENREEL = Open Reel Tape
      
recording_format.VIDEO-CASS = VHS
recording_format.RSRV-VCASS = VHS

recording_format.VIDEO-DISC = Video Disc
recording_format.RSRV-VDISC = Video Disc

and the set of strings gathered for the item consisted of {AUDIO-CASS, MUSIC-CASS} the final returned result would be {Cassette}.

In the case where you want to define only a single translation map in a properties file, which might be the case for large translation maps, you can specify only the name of the properties file on the index field specification line as shown below:

instrument_facet = 048a[0-1], instrument_map.properties

In this case all of the entries that occur in that file will be used to define the translation map, and there is no need to prefix the property keys with a common string, so that the instrument map would be defined as follows:

ba = Horn
bb = Trumpet
bc = Cornet
bd = Trombone
be = Tuba
bf = Baritone horn
bn = Brass, Unspecified
bu = Brass, Unknown
by = Brass, Ethnic
bz = Brass, Other
ca = Choruses, Mixed
cb = Chorus, Women's
cc = Choruses, Men's
cd = Choruses, Children's
cn = Choruses, Unspecified
cu = Chorus, Unknown
cy = Choruses, Ethnic
ea = Synthesizer
eb = Electronic Tape
ec = Computer
ed = Ondes Martinot
en = Electronic, Unspecified
eu = Electronic, Unknown

This makes the creation and maintaining of translation map properties files much easier to understand.

Defining a Pattern-Based Translation Map

The previous section described how to define a translation map for a field. However, one limitation of it is that it can only map from a fixed, pre-specified set of values. If the value in the field doesn’t exactly match one of the translation keys, that value will not be mapped to any other value, and usually would then be discarded.

Sometimes you may want to look for a pattern of characters somewhere in the input field, and if that pattern occurs, then output some value to the index field. To specify this in the field specification entry, specify the name of the translation map as described above:

ports_facet = 650c:650z:651a:651x:651z:655z, semester_at_sea.properties(port)

Then define the translation map like this:

port.pattern_0 = Nassau.*Bahamas=>Nassau
port.pattern_1 = Salvador.*Brazil=>Salvador
port.pattern_2 = Walvis Bay.*Namibia=>Walvis Bay
port.pattern_3 = Cape Town.*South Africa=>Cape Town
port.pattern_4 = Chennai.*India=>Chennai
port.pattern_5 = Penang.*Malaysia=>Penang
port.pattern_6 = Ho Chi Minh City.*Vietnam=>Ho Chi Minh City
port.pattern_7 = Hong Kong=>Hong Kong
port.pattern_8 = Shanghai.*China=>Shanghai
port.pattern_9 = Kobe.*Japan=>Kobe
port.pattern_10 = Yokohama.*Japan=>Yokohama
port.pattern_11 = Puntarenas.*Costa Rica=>Puntarenas
port.pattern_12 = Bombay.*India=>Chennai
port.pattern_13 = Namibia=>Namibia
port.pattern_14 = South Africa=>South Africa
port.pattern_15 = India=>India
port.pattern_16 = Malaysia=>Malaysia
port.pattern_17 = Vietnam=>Vietnam
port.pattern_18 = China=>China
port.pattern_19 = Japan=>Japan
port.pattern_20 = Costa Rica=>Costa Rica
port.pattern_21 = Bahamas=>Bahamas
port.pattern_22 = Brazil=>Brazil

Then for every field that is extracted from a given MARC record will be matched against all of the patterns specified in the map. Note that these entries must start with (map_identifier).pattern_0 and proceed sequentially from there. When using multiple translation maps, each map identifier ("port" in the example above) must be unique. The value of the pattern is then split at the => with the portion before the arrow being used as a regular expression, and if that regular expression matches anywhere inside any of the fields extracted from the MARC record, the string that occurs after the arrow will be added to the index record.

In this example if a single field extracted from the MARC record contained Chennai, followed eventually by India, the value Chennai would be added to the index. If that same field also contained Penang followed by Malaysia, the value Penang would be added to the index also. Notice that for the last entries in the map above, the pattern that is looked-for is a simple string. So based on pattern_19 above if one of the fields extracted from the MARC record contains the word Japan, then the word Japan will be added to the index.

Another way of using the pattern-based translation map feature is to trim out a portion of the original string, using the regular expression grouping characters ( and ) and the $1 syntax for the replacement string. For example, suppose your records have several 035 fields, and that some of these field contain OCLC numbers, which are indicated in the field by having a prefix of (OCLC) before the number to use as shown in the following example record:

LEADER 00873pam a2200277 a 4500
001 u17922
008 831011s1984    njua          00110 eng
010   $a   83022049
020   $a0135959195 (pbk.)
035   $a(Sirsi) l83022049
035   $a(OCLC)10072685
039 0 $a2$b3$c3$d3$e3
040   $aDLC$cDLC$dVA@
049   $aVA@&
050 0 $aZ52.4$b.G34 1984
082 0 $a652$219
090   $aZ52.4$b.G34 1984$mVA@&$qGRAD BUS.
100 1 $aGalloway, Dianne.
245 10$aLearning to talk word processing /$cDianne Galloway.
260   $aEnglewood Cliffs, N.J. :$bPrentice-Hall,$cc1984.
300   $aviii, 119 p. :$bill. ;$c23 cm.
490 0 $aThe Modern office series
500   $aIncludes index.
596   $a13
650  0$aWord processing.

For this example if you wanted to select only the number portion of 035 lines that were OCLC numbers you could use the following index specification:

oclc_text = 035a, (pattern_map.oclc_num)

and then use the following pattern-based translation map:

pattern_map.oclc_num.pattern_0 = \\(OCLC\\)(.*)=>$1

which will discard the first 035 field from above, and then map the second field to the value 10072685.

Similarly, if you want to trim off everything following the initial letters of an LC call number, you could use the following pattern map:

pattern_map.call_num.pattern_0 = ([A-Za-z]*).*=>$1

Pattern Map Modifiers

There are three special values available which can be used as the last entry in a pattern map to modify the mapping behavior:

  • filter = filter the output based on the patterns in the map
  • keepRaw = if an input value is not matched by any of the patterns in the map, retain its original raw value in the output.
  • matchAll = apply each pattern in succession to the input string (i.e. apply pattern_0 to the input string, then apply pattern_1 to the output of pattern_0, etc., etc.)