# Using `pymarc` on Roger records

We will use the python library `pymarc` on Roger records to investigate if we can use this as a quicker way to go from Roger->DAMS. For the MARC source data, I'm using the May 2018 batch MARC file from Roger, which has 100s of records within it. This would be probably much larger than the average digital collection, but can still serve as a good data source to test. 

Note: if this will be used as a script, the ideal scenario would be having a way to directly get MARC records from Roger.

## Step 1: Reading in MARC records and exploring the data

Using the [GitHub tutorial on reading](https://github.com/edsu/pymarc#reading)

In [12]:
from pymarc import MARCReader

titles = []
with open('data/D180501.mrk', 'rb') as fh:
    reader = MARCReader(fh)
    for record in reader:
        titles.append(record['245']['a']) # This will only capture 245 subfield a

In [13]:
print(titles[0:10])

['El nuevo indio :', 'California design', 'Best of college photography annual', 'Contemporary Japanese prints', 'Transmission, speaking & listening', 'Con sabor a España /', 'Super quad', 'Fiesta bailable U.S.A', 'Insieme', 'Romance y cuerdas /']


There seems to be a lot of `/` slashes where the subfield ends. Let's use some regex

In [21]:
import re
with open('data/D180501.mrk', 'rb') as fh:
  reader = MARCReader(fh)
  for record in reader:
    
    #Get title (245 |a)
    if record['245'] is not None:
      if record['245']['a'] is not None:
        #title = record['245']['a']
        title = record['245']['a'].rsplit('/', 1)[0]
        title = title.rsplit(':', 1)[0]
      else:
        title = 'None'
    else:
        title = 'None'

We're only getting subfield `a` so far. We'd want to get the subtitles as well. Luckily, `pymarc` has a method for this, simply called `title`. Let's use it with the regex above, plus let's strip out the trailing whitespace

In [41]:
titles = []

with open('data/D180501.mrk', 'rb') as fh:
  reader = MARCReader(fh)
  for record in reader:
    
    #Get title (245 |a and |b)
    if record.title() is not None:
      title = record.title()  
      title = title.rsplit('/', 1)[0]
      title = title.rsplit(':', 1)[0]
      title = title.strip()
      titles.append(title)
    else:
      title = 'None'

In [42]:
print(titles[0:100])

['El nuevo indio', 'California design', 'Best of college photography annual', 'Contemporary Japanese prints', 'Transmission, speaking & listening', 'Con sabor a España', 'Super quad', 'Fiesta bailable U.S.A', 'Insieme', 'Romance y cuerdas', 'The best of Roll Records', 'Pedagogy of hope', 'Cancer risk assessment', "Crain's small business", 'Energy security', 'Time series', 'Africa after gender?', 'Pay', 'Making sense of public opinion', 'Testosterone', 'Structural equation modeling', "China's search for security", 'Classical algebraic geometry', 'Electoral systems and political context', 'Occupy money', 'The physics of dilute magnetic alloys', 'Reparations for Nazi victims in postwar Europe', 'Idols and celebrity in Japanese media culture', 'Mapping the Chinese and Islamic worlds', 'Ontology revisited', 'The Politics and Ethics of Identity', 'Aristotle on desire', 'Democracy at large', 'Frege on absolute and relative truth', 'Frege', 'Poverty and sickness in modern Europe', 'Research me

We should now check all the methods we can do in the `Record` class

In [38]:
import pymarc
help(pymarc.Record)

Help on class Record in module pymarc.record:

class Record(builtins.object)
 |  A class for representing a MARC record. Each Record object is made up of
 |  multiple Field objects. You'll probably want to look at the docs for Field
 |  to see how to fully use a Record object.
 |  
 |  Basic usage:
 |  
 |      field = Field(
 |          tag = '245',
 |          indicators = ['0','1'],
 |          subfields = [
 |              'a', 'The pragmatic programmer : ',
 |              'b', 'from journeyman to master /',
 |              'c', 'Andrew Hunt, David Thomas.',
 |          ])
 |  
 |      record.add_field(field)
 |  
 |  Or creating a record from a chunk of MARC in transmission format:
 |  
 |      record = Record(data=chunk)
 |  
 |  Or getting a record as serialized MARC21.
 |  
 |      raw = record.as_marc()
 |  
 |  You'll normally want to use a MARCReader object to iterate through
 |  MARC records in a file.
 |  
 |  Methods defined here:
 |  
 |  __contains__(self, tag)
 |     

In [61]:
subjects = []

with open('data/D180501.mrk', 'rb') as fh:
  reader = MARCReader(fh)
  for record in reader:
    #Get subjects 
    subjects = record.subjects()
    for s in subjects:
        print(s.format_field())

Indians of South America -- Peru
Peru -- Civilization
Decorative arts -- California -- Exhibitions -- Periodicals
Decorative arts -- Exhibitions. fast (OCoLC)fst00889326
California. fast (OCoLC)fst01204928
Exhibition catalogs. fast (OCoLC)fst01424028
Periodicals. fast (OCoLC)fst01411641
Exhibition catalogs. lcgft
Periodicals. lcgft
Photography -- Competitions -- Periodicals
Photography -- Competitions. fast (OCoLC)fst01061726
Periodicals. fast (OCoLC)fst01411641
1900-1999 fast
Prints, Japanese -- 20th century -- Catalogs
Prints, Japanese. fast (OCoLC)fst01076922
Catalogs. fast (OCoLC)fst01423692
2000-2099 fast
Artists -- Interviews -- Periodicals
Art, Modern -- 21st century -- Periodicals
Arts, Modern -- 21st century -- Periodicals
Art, Modern. fast (OCoLC)fst00816615
Artists. fast (OCoLC)fst00817559
Arts, Modern. fast (OCoLC)fst00818137
Interviews. fast (OCoLC)fst01423832
Periodicals. fast (OCoLC)fst01411641
Popular music
Popular music. lcgft
Rap (Music)
Popular music -- 1991-2000
Hip

These are now `Field` objects. We should see what we need to do with these

In [51]:
help(pymarc.Field)

Help on class Field in module pymarc.field:

class Field(builtins.object)
 |  Field() pass in the field tag, indicators and subfields for the tag.
 |  
 |      field = Field(
 |          tag = '245',
 |          indicators = ['0','1'],
 |          subfields = [
 |              'a', 'The pragmatic programmer : ',
 |              'b', 'from journeyman to master /',
 |              'c', 'Andrew Hunt, David Thomas.',
 |          ])
 |  
 |  If you want to create a control field, don't pass in the indicators
 |  and use a data parameter rather than a subfields parameter:
 |  
 |      field = Field(tag='001', data='fol05731351')
 |  
 |  Methods defined here:
 |  
 |  __contains__(self, subfield)
 |      Allows a shorthand test of field membership:
 |      
 |          'a' in field
 |  
 |  __getitem__(self, subfield)
 |      Retrieve the first subfield with a given subfield code in a field:
 |      
 |          field['a']
 |      
 |      Handy for quick lookups.
 |  
 |  __init__(self, tag