# Unmapped MARC fields and subfields
This notebook:
* Documents MARC fields and subfields omitted from Argot mappings
* Identifies MARC mapping decisions to be made, by analyzing:
 * current MARC specification from http://www.loc.gov/marc/bibliographic/ecbdlist.html 
 * Argot mappings from https://github.com/trln/data-documentation/blob/master/argot/argot.xlsx
 * Our explicit field and subfield omission decisions 

MARC fields we have decided to exclude from Argot mappings are recorded in [`unmapped_marc_tags.json`](https://github.com/trln/data-documentation/blob/master/marc/unmapped_marc_tags.json).

Specific MARC subfields excluded from Argot mappings *when some subfields from the field ARE mapped* are recorded in [`unmapped_marc_subfields.json`](https://github.com/trln/data-documentation/blob/master/marc/unmapped_marc_subfields.json).

**Jump to:**
* [Unmapped MARC fields - decisions](#unmapped-field-decisions)
* [Unmapped MARC fields - to decide/document](#unmapped-fields-to-decide)
* [Unmapped MARC subfields - decisions](#unmapped-subfield-decisions)
* [Unmapped MARC subfields - to decide/document](#unmapped-subfields-to-decide)

## Use the current MARC specification
First, run script to generate JSON hash of current MARC specification. If successful: 
* Out will = true
* `marc_definition.json` is written to `data-documentation/marc` folder

In [1]:
require './extract_marc_bib_spec'

true

## Get all our mappings data
This next bit just reads in all the data we'll need: 
* `marc_spec` is the updated MARC specification created in the previous step
* `unmapped_fields` is our explicit decisions about MARC fields not mapped to Argot
* `unmapped_subfields` is our explicit decisions about individual MARC subfields not mapped to Argot. These subfields all occur in MARC fields that are otherwise mapped to Argot. 

In [2]:
marc_spec = JSON.parse(File.read('marc_definition.json'))
unmapped_fields = JSON.parse(File.read('unmapped_marc_tags.json'))
unmapped_subfields = JSON.parse(File.read('unmapped_marc_subfields.json'))
puts ''




Next get our current MARC-to-Argot mappings from Argot spreadsheet and turn this into a hash we can use:

In [3]:
require 'simple_xlsx_reader'
doc = SimpleXlsxReader.open('../argot/argot.xlsx')
mappings_d = doc.sheets[1].data
mappings_h = doc.sheets[1].headers
marc_only = mappings_d.select{ |r| r[2] == "MARC"}.map{ |r| [r[5].rjust(3,'0'), r[6], r[3]] }
# Remove mappings for fixed fields and those where data is a constant
marc_only.reject!{ |r| r[0] =~ /(00[678]|\{na\})/}
# Change {na}, {varies}, period, space, or parentheses in subfield list to nothing
marc_only.map!{ |r| [r[0], r[1].gsub(/\{(na|varies)\}|[.() ]/, ''), r[2]]}
# Convert subfield list to array
marc_only.map!{ |r| [r[0], r[1].split(''), r[2]]}
# explode mappings list, so there's one array per subfield value
# orig: ["111", ["j", "4"], "n"]
# new: ["111", "j", "n"], ["111", 4", "n"]
flat_mappings = []
marc_only.each do |r|  
  r[1].each { |sf| flat_mappings << [r[0], sf, r[2]] }
end
mapping_hash = {}
flat_mappings.uniq!.each do |r|
  if mapping_hash.has_key?(r[0])
    mapping_hash[r[0]][r[1]] = r[2]
  else
    mapping_hash[r[0]] = {r[1] => r[2]}
  end
end
puts ''




The above creates a hash with the following basic format: 

```
{
 "581": {
    "a": "n",
    "z": "n",
    "3": "n"
  },
  ...
}
```

Subfields a, z, and 3 from the 581 field are mapped to Argot. 

The "n" at the end of each subfield line means these are final mappings for production. "y" in this position indicates a provisional mapping, which may not yet be implemented or which is subject to major changes.

## Unmapped MARC, field tag level
The following code block formats our unmapped field decisions for display

In [4]:
unmapped_tag_table = { 'MARC tag' => [],
                       'Name' => [],
                       'Omission type' => [],
                       'Omission category' => [],
                       'Details' => []
                     }
unmapped_fields.sort.each do |tag, info|
  unmapped_tag_table['MARC tag'] << tag
  unmapped_tag_table['Name'] << marc_spec[tag]['name']
  unmapped_tag_table['Omission type'] << info['omission_type']
  unmapped_tag_table['Omission category'] << info['omission_reason_category']
  unmapped_tag_table['Details'] << info['omission_details']
end
puts ""




### Explicit decisions and justifications <a class="anchor" id="unmapped-field-decisions"></a>
The block below displays the fields we've explicitly decided not to map.

In [5]:
IRuby.display IRuby.table(unmapped_tag_table, maxrows: 200)

MARC tag,Name,Omission type,Omission category,Details
13,PATENT CONTROL INFORMATION,temporary,infrequently used,60 instances in UNC catalog on 2019-06-14. Most look like miscoded fields. Also of limited public use?
16,NATIONAL BIBLIOGRAPHIC AGENCY CONTROL NUMBER,temporary,low priority,"Over 244,000 instances in UNC catalog on 2019-06-14. Could map to misc_id, but this was initially considereed a low priority."
17,COPYRIGHT OR LEGAL DEPOSIT NUMBER,temporary,infrequently used,"154 instances in UNC catalog as of 2019-06-14. At least in our catalog, users can't expect to find this information consistently. Could map to misc_id if necessary."
18,COPYRIGHT ARTICLE-FEE CODE,permanent,limited public use,Unique identification code for component parts appearing in monographs or continuing resources. Cryptic. Unlikely to assist with discovery.
25,OVERSEAS ACQUISITION NUMBER,permanent,internal,Internal to Library of Congress. Means nothing for our institutions.
26,FINGERPRINT IDENTIFIER,permanent,internal,"Used to assist in the identification of antiquarian books by recording information comprising groups of characters taken from specified positions on specified pages of the book. Example: 026 ##$ae-s- 11as$bs,me crth 3$c1797.$dv.1"
31,MUSICAL INCIPITS INFORMATION,temporary,infrequently used,"Only 5 instances of field in UNC catalog on 2019-06-14. Very cryptic and would take a lot of work to transform into something useful to end user. Could be useful in an eventual specialized music discovery tool, but this information might be coded elsewhere in the record in more usable ways."
32,POSTAL REGISTRATION NUMBER,permanent,internal,Number assigned to a publication for which the specified postal service permits the use of a special mailing class privilege.
33,DATE/TIME AND PLACE OF AN EVENT,temporary,limited public use,"15,564 occurrences in UNC catalog on 2019-06-14. Could be useful, but this isn't really a chronological *subject* and it's not a publication, etc. date either. Would need to make decisions about whether this is useful for discovery and how to best use it."
34,CODED CARTOGRAPHIC MATHEMATICAL DATA,temporary,limited public use,"56,450 occurrences in UNC catalog on 2019-06-14. Could be useful in specialized cartographic discovery tool, but doesn't contribute much in the context of our general discovery application. Cryptic, not easily translated into user-sensible display. Information is usually recorded in less structured, more user-readable way elsewhere in record."


#<CZTop::Socket::PUB:0x2bfb3c0 last_endpoint="tcp://127.0.0.1:54244">

The following prepares a list of all other **non-obsolete, non-local** unmapped MARC variable fields we haven't made an explicit decision about:

In [6]:
unmapped = {}
marc_spec.each do |tag, info|
  if mapping_hash.has_key?(tag)
    next
  elsif unmapped_fields.has_key?(tag)
    next
  elsif tag =~ /00[356789]/
    next
  elsif info['context'] =~ /OBSOLETE|LOCAL/
  else
    unmapped[tag] = {'name' => info['name'].strip,
                     'context' => info['context']
      }
  end
end

table2 = { 'MARC tag' => [],
           'Description' => [],
           'Note' => []
         }
unmapped.sort.each do |tag, info|
  table2['MARC tag'] << tag
  table2['Description'] << info['name']
  table2['Note'] << info['context']
end
puts ''




Number of unmapped MARC tags:

In [7]:
puts table2['MARC tag'].size

55


### Unmapped MARC fields without recorded decision and/or justification <a class="anchor" id="unmapped-fields-to-decide"></a>

In [8]:
IRuby.display IRuby.table(table2, maxrows: 100)

MARC tag,Description,Note
50,LIBRARY OF CONGRESS CALL NUMBER,
51,"LIBRARY OF CONGRESS COPY, ISSUE, OFFPRINT STATEMENT",
52,GEOGRAPHIC CLASSIFICATION,
60,NATIONAL LIBRARY OF MEDICINE CALL NUMBER,
61,NATIONAL LIBRARY OF MEDICINE COPY STATEMENT,
82,DEWEY DECIMAL CLASSIFICATION NUMBER,
83,ADDITIONAL DEWEY DECIMAL CLASSIFICATION NUMBER,
85,SYNTHESIZED CLASSIFICATION NUMBER COMPONENTS,
86,GOVERNMENT DOCUMENT CLASSIFICATION NUMBER,
242,TRANSLATION OF TITLE BY CATALOGING AGENCY,


#<CZTop::Socket::PUB:0x2bfb3c0 last_endpoint="tcp://127.0.0.1:54244">

## Unmapped MARC, subfield level
Here, we are looking at subfields in MARC fields that are at least partly mapped to Argot. 
What subfields in such partially mapped fields are **not** mapped to Argot? 

The code below prepares our subfield exclusion decisions for display:

In [9]:
unmapped_sf_table = { 'MARC tag' => [],
                      'Subfield code' => [],
                       'Field name' => [],
                       'Subfield name' => [],
                       'Omission type' => [],
                       'Omission category' => [],
                       'Details' => []
                     }
unmapped_subfields.sort.each do |tag, sfs|
  sfs.each do |code, info|
    unmapped_sf_table['MARC tag'] << tag
    unmapped_sf_table['Subfield code'] << code
    unmapped_sf_table['Field name'] << marc_spec[tag]['name']
    unmapped_sf_table['Subfield name'] << marc_spec[tag]['subfields'][code]['name']
    unmapped_sf_table['Omission type'] << info['omission_type']
    unmapped_sf_table['Omission category'] << info['omission_reason_category']
    unmapped_sf_table['Details'] << info['omission_details']
  end
end

[["020", {"c"=>{"omission_type"=>"permanent", "omission_reason_category"=>"data quality/currency; misleading to public", "omission_details"=>"Terms of availability change rapidly. We do not keep this field updated. Do not mislead patrons about how much a thing costs."}}], ["022", {"m"=>{"omission_type"=>nil, "omission_reason_category"=>nil, "omission_details"=>nil}, "y"=>{"omission_type"=>nil, "omission_reason_category"=>nil, "omission_details"=>nil}, "z"=>{"omission_type"=>nil, "omission_reason_category"=>nil, "omission_details"=>nil}}], ["024", {"c"=>{"omission_type"=>"permanent", "omission_reason_category"=>"data quality/currency; misleading to public", "omission_details"=>"Terms of availability change rapidly. We do not keep this field updated. Do not mislead patrons about how much a thing costs."}}], ["044", {"b"=>{"omission_type"=>"temporary", "omission_reason_category"=>"low priority", "omission_details"=>"UNC has no data in this subfield. Consider mapping to origin_place fields

### Explicit subfield exclusion decisions:<a class="anchor" id="unmapped-subfield-decisions"></a>

In [10]:
IRuby.display IRuby.table(unmapped_sf_table, maxrows: 100)

MARC tag,Subfield code,Field name,Subfield name,Omission type,Omission category,Details
20,c,INTERNATIONAL STANDARD BOOK NUMBER,Terms of availability,permanent,data quality/currency; misleading to public,Terms of availability change rapidly. We do not keep this field updated. Do not mislead patrons about how much a thing costs.
22,m,INTERNATIONAL STANDARD SERIAL NUMBER,Canceled ISSN-L,,,
22,y,INTERNATIONAL STANDARD SERIAL NUMBER,Incorrect ISSN,,,
22,z,INTERNATIONAL STANDARD SERIAL NUMBER,Canceled ISSN,,,
24,c,OTHER STANDARD IDENTIFIER,Terms of availability,permanent,data quality/currency; misleading to public,Terms of availability change rapidly. We do not keep this field updated. Do not mislead patrons about how much a thing costs.
44,b,COUNTRY OF PUBLISHING/PRODUCING ENTITY CODE,Local subentity code,temporary,low priority,UNC has no data in this subfield. Consider mapping to origin_place fields if $2 value is set and use of subfield increases
44,c,COUNTRY OF PUBLISHING/PRODUCING ENTITY CODE,ISO country code,temporary,low priority,UNC 835 instances of 044s with this subfield on 2019-06-14. Many of them recode teh country coded in the $a. Consider mapping to origin_place fields if amount of unique data in this subfield increases.
567,b,METHODOLOGY NOTE,Controlled term,temporary,low priority,UNC has no data in this subfield. Consider mapping to genre_headings and subject_genre if ever meaningfully populated


#<CZTop::Socket::PUB:0x2bfb3c0 last_endpoint="tcp://127.0.0.1:54244">

### Unmapped subfields from MARC specification
These subfields may have been overlooked. If an explicit decision has been made to exclude this subfield from mapping to Argot, that decision should be documented in https://github.com/trln/data-documentation/blob/master/marc/unmapped_marc_subfields.json

First we flatten out all subfield data so it's easier to work with:

In [11]:
subfields = []
marc_spec.each do |tag, info|
  if info['subfields'].size > 0
    info['subfields'].each do |code, sfinfo|
      subfields << {'tag' => tag,
        'code' => code,
        'name' => "#{info['name']} - #{sfinfo['name']}",
        'context' => sfinfo['context']}      
    end
  end
end

puts "Number of subfields: #{subfields.size}"

Number of subfields: 2533


Then, get rid of subfields that are in unmapped MARC fields. Keep only those whose field tag appears in our mappings spreadsheet.

In [12]:
subfields.select!{ |sf| mapping_hash.has_key?(sf['tag']) }
puts "Number of subfields: #{subfields.size}"

Number of subfields: 1554


Then, remove:
* Obsolete or local subfields
* Subfield 0 where subfield name = Authority record control number or standard number
* Subfield 1 where subfield name = Real World Object URI
* Subfield 2 where subfield name starts with Source
* Subfield 6 where subfield name = Linkage
* Subfield 7 where subfield name = Control subfield
* Subfield 8 where subfield name = Field link and sequence number

In [13]:
subfields.reject!{ |sf| sf['code'] == '0' && sf['name'] =~ / - Authority record control number/  }
subfields.reject!{ |sf| sf['code'] == '1' && sf['name'] =~ / - Real World Object URI/ }
subfields.reject!{ |sf| sf['code'] == '2' && sf['name'] =~ / - Source/ }
subfields.reject!{ |sf| sf['code'] == '6' && sf['name'] =~ / - Linkage/ }
subfields.reject!{ |sf| sf['code'] == '7' && sf['name'] =~ / - Control subfield/ }
subfields.reject!{ |sf| sf['code'] == '8' && sf['name'] =~ / - Field link and sequence number/ }
subfields.reject!{ |sf| sf['context'] =~ /OBSOLETE|LOCAL/ }
puts "Number of subfields: #{subfields.size}"

Number of subfields: 1129


Then, remove subfields we have mapped in Argot spreadsheet:

In [14]:
subfields.reject!{ |sf| mapping_hash.has_key?(sf['tag']) && mapping_hash[sf['tag']].has_key?(sf['code']) }
puts "Number of subfields: #{subfields.size}"

Number of subfields: 155


Then, prepare the remaining unmapped subfields to be output as a table for analysis:

In [15]:
table4 = { 'MARC tag' => [],
  'subfield code' => [],
  'name' => []
  }

subfields.each do |sf|
  table4['MARC tag'] << sf['tag']
  table4['subfield code'] << sf['code']
  table4['name'] << sf['name']
end

[{"tag"=>"015", "code"=>"z", "name"=>"NATIONAL BIBLIOGRAPHY NUMBER  - Canceled/Invalid national bibliography number", "context"=>""}, {"tag"=>"020", "code"=>"c", "name"=>"INTERNATIONAL STANDARD BOOK NUMBER  - Terms of availability", "context"=>""}, {"tag"=>"022", "code"=>"m", "name"=>"INTERNATIONAL STANDARD SERIAL NUMBER  - Canceled ISSN-L", "context"=>""}, {"tag"=>"022", "code"=>"y", "name"=>"INTERNATIONAL STANDARD SERIAL NUMBER  - Incorrect ISSN", "context"=>""}, {"tag"=>"022", "code"=>"z", "name"=>"INTERNATIONAL STANDARD SERIAL NUMBER  - Canceled ISSN", "context"=>""}, {"tag"=>"024", "code"=>"c", "name"=>"OTHER STANDARD IDENTIFIER  - Terms of availability", "context"=>""}, {"tag"=>"028", "code"=>"a", "name"=>"PUBLISHER NUMBER OR DISTRIBUTOR NUMBER  - Publisher or distributor number", "context"=>""}, {"tag"=>"035", "code"=>"z", "name"=>"SYSTEM CONTROL NUMBER  - Canceled/invalid control number", "context"=>""}, {"tag"=>"041", "code"=>"b", "name"=>"LANGUAGE CODE  - Language code of sum

### Unmapped subfields to decide on/document:<a class="anchor" id="unmapped-subfields-to-decide"></a>

In [16]:
IRuby.display IRuby.table(table4, maxrows: 200)

MARC tag,subfield code,name
15,z,NATIONAL BIBLIOGRAPHY NUMBER - Canceled/Invalid national bibliography number
20,c,INTERNATIONAL STANDARD BOOK NUMBER - Terms of availability
22,m,INTERNATIONAL STANDARD SERIAL NUMBER - Canceled ISSN-L
22,y,INTERNATIONAL STANDARD SERIAL NUMBER - Incorrect ISSN
22,z,INTERNATIONAL STANDARD SERIAL NUMBER - Canceled ISSN
24,c,OTHER STANDARD IDENTIFIER - Terms of availability
28,a,PUBLISHER NUMBER OR DISTRIBUTOR NUMBER - Publisher or distributor number
35,z,SYSTEM CONTROL NUMBER - Canceled/invalid control number
41,b,LANGUAGE CODE - Language code of summary or abstract
41,f,LANGUAGE CODE - Language code of table of contents


#<CZTop::Socket::PUB:0x2bfb3c0 last_endpoint="tcp://127.0.0.1:54244">