<a href="https://colab.research.google.com/github/ysugiyama3/google_colab/blob/master/microfilm_boundwiths_cleanup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Duplicate Barcode Cleanup for Microfilm Boundwiths**


In Alma, if the barcode exists but is not unique, the barcode will be migrated but disambiguated by putting the internal item_id after the barcode: [item_barcode]-[item_id]. For more information, go to [this page](https://knowledge.exlibrisgroup.com/Alma/Implementation_and_Migration/Migration_Guides_and_Tutorials/Voyager_to_Alma_Migration_Guide#Item_Barcodes).
We have about 18,900 items with duplicate barcode. Microfilms account for 80%, some of which can be cleaned up in this program following [the Bound-With procedures for serials and multiparts](https://web.library.yale.edu/cataloging/bound-procedures-serials-multiparts). 

By yukari.sugiyama@yale.edu

---


What you need:
* A base Excel spreadsheet for duplicate item records that can be batch 
processed. Please make sure to exclude item records that were processed manually. Column O (Method) = 'Batch'
* MarcExtract (\TS_Local\MarcExtract)
* [RecordReloader](https://files.library.northwestern.edu/public/RecordReloader/)
* Pick and Scan (Voyager Cataloging module)


The program generates the following outputs:
* **item_to_delete.txt** that is used for item record deletion in Voyager's Pick and Scan tool
* **mfhdid_list.txt** that is used in the MarcExtract toold to extract specific MFHD MARC records
* **mfhd_output.mrc** that is used to update MFHD records in Gary Strawn's RecordReloader tool


---


In [None]:
#@title 1. Upload an input Excel file
!pip install pymarc &> /dev/null

from google.colab import files
import pandas as pd
from pymarc import *
from IPython.display import HTML, display
import time
from datetime import date

# Upload an input Excel file
uploaded = files.upload()
input_name = str(list(uploaded.keys())[0])

# Read an input Excel file into a pandas DataFrame
input_df = pd.read_excel(input_name)

# Create dictionaries and lists
barcode_bib_dict = dict()
item_to_delete_list = list()
mfhdid_list = list()
mfhd_bib_dict = dict()

for index, row in input_df.iterrows():
    itemid = str(row[0])
    barcode = str(row[3])
    mfhdid = str(row[5])
    bibid = str(row[9])
    comment = row[16]
    if comment == 'Keep' and barcode not in barcode_bib_dict:
        barcode_bib_dict.update({barcode : bibid})
    else:
        if mfhdid not in mfhdid_list:
            mfhdid_list.append(mfhdid)
        if itemid not in item_to_delete_list:
            item_to_delete_list.append(itemid)

for index, row in input_df.iterrows():
    barcode = str(row[3])
    mfhdid = str(row[5])
    host_bibid = barcode_bib_dict.get(barcode)
    if mfhdid in mfhdid_list and mfhdid not in mfhd_bib_dict:
        mfhd_bib_dict.update({mfhdid : host_bibid})

# Create and write mfhdid_list.txt
mfhdid_list_output = 'mfhdid_list.txt'

with open(mfhdid_list_output, 'w') as m:
    m.writelines('\n'.join(mfhdid_list))
    m.write('\n')

# Create and write item_to_delete.txt
item_to_delete_output = 'item_to_delete.txt'

with open(item_to_delete_output, 'w') as i: # Make sure each line ends with CRLF
    i.writelines('\r\n'.join(item_to_delete_list))
    i.write('\r\n')

# Total number of holdings records to edit
mfhd_total = len(mfhdid_list)

# Total number of item records to delete
item_total = len(item_to_delete_list)

# Done. Print results
print('\n\nNumber of MFHD records to edit: ', mfhd_total)
print('Number of ITEM records to delete: ', item_total)
print('Download mfhdid_list.txt to obtain a MFHD MARC file using MarcExtract and then move on to the 2nd step')


In [None]:
#@title 2. Upload an input MFHD MARC file
uploaded_mrc = files.upload()
in_mrc_name = str(list(uploaded_mrc.keys())[0])
out_mrc_name = in_mrc_name[:-4] + "_output.mrc"
writer = MARCWriter(open(out_mrc_name, mode="wb"))

def add_field_856():
    bib = mfhd_bib_dict[mfhd]
    url = 'http://hdl.handle.net/10079/bibid/' + bib
    field_856 = pymarc.Field(
      tag='856', 
      indicators = ['4','2'],
      subfields = ['u', url,'z', 'Click here to request']
      ,)
    record.add_ordered_field(field_856) 

def progress(value, max=50000):
    return HTML("""
        <progress
            value='{value}'
            max='{max}',
            style='width: 40%'
        >
            {value}
        </progress>
        <br>{value}/{max}</br>
    """.format(value=value, max=max))

count = 0
out = display(progress(0, mfhd_total), display_id=True)

with open(in_mrc_name, mode="rb") as fh:
    reader = MARCReader(fh) 
    # read each record
    for record in reader:
        count += 1
        out.update(progress(count, mfhd_total))
        mfhd = record['001'].value()
        if mfhd in mfhdid_list:
            add_field_856()
            writer.write(record)
            time.sleep(2)

writer.close()
print('\n\nDone! \U0001f44D \nPlease load mfhd_output.mrc using RecordReloader')

In [None]:
#@title 3. Refresh program (optional)
from IPython.display import clear_output
from google.colab import runtime
clear_output()
runtime.unassign()
