# Data Mining the Internet Archive Collection

Link to PubPub: https://vivian-tran-dh.pubpub.org/pub/data-mining-the-internet-archive-dh/release/1

# Source

Caleb McDaniel, "Data Mining the Internet Archive Collection," Programming Historian 3 (2014), https://doi.org/10.46430/phen0035.

# Reflection

The Internet Archive can be considered a vast playground to scholars, students, artists, or casual web-crawlers alike. It contains millions of entries across history, ranging from images, videos, literature, and archived websites free to access. In addition to the IA's collections, users of the site are also free to upload their own content to the record. 

As an artist, the IA has been incredibly helpful to my own work surrounding archives. With many entries being public domain, it has provided me invaluable material to help me further explore the themes I'm interested in. I chose this lesson because I would love to accquire more skills to help me navigate the IA with more efficiency.

In this lesson, we worked with the IA's records of the Boston Public Library's Anti-Slavery collection. Python was able to assist in mass downloading items of a specified collection, including MARC records. MARC records contain rich metadata about each item, such as year and location, which could be of importance for historians. Python was also able to assist in the gathering and organization of these records. 

Because the collection we worked with contained over 7,000 files, for the purposes of the lesson, I only downloaded several. I have yet to learn any data visualization tools, so until then, I have only a printed list of the information I gathered. I am unsure how the excess colons and commas within the output can be removed, as the functions we used to generate this list don't offer much transparency. 

# Accessing a Collection

In [6]:
import internetarchive

In [7]:
# Using search_items function to see how many items are in the antislavery collection at the Boston Public Library
search = internetarchive.search_items('collection:bplscas')
print(search.num_found)

9768


In [10]:
# Accessing individual items with identifiers
item = internetarchive.get_item('lettertowilliaml00doug')
item.download()

[]

In [12]:
# Looking at many items in a collection
search = internetarchive.search_items('collection:bplscas')

for result in search:
    print(result['identifier'])

100conventionsma00mays
16thnationalanti00chap
39999081179663Images
abolitionismrevi00lewi
abolitionist00newe_0
abolitionist00newe_1
abolitionist00newe_2
abolitionist1833newe
abolitionist1833newe10
abolitionist1833newe11
abolitionist1833newe12
abolitionist1833newe2
abolitionist1833newe3
abolitionist1833newe4
abolitionist1833newe5
abolitionist1833newe6
abolitionist1833newe7
abolitionist1833newe8
abolitionist1833newe9
abolitionistsnew00phel
abolitionofslave1862garr
abolitionrieties00jone
abstractofeviden00grea
abstractofreport00grea
accountbookoflib01mass
accountbookoflib02mass
accountofmeeting00mays
accountofslavetr00falc
accountrecordsma00mays
addressdelivered00cook
addressfromcommi00unse
addressofincorpo00inco
addressofliberty00libe
addressonslavery00madd
addressonwestind00will
addressreplyonpr00chas
addresstoaboliti1838mass
addresstochristi00lewi
addresstofriends01amer
addresstomembers00grea
addresstopeopleo01foxw
addresstopeopleo1791foxw
addresstopeopleo1792foxw
addresstopeopleo92fox

# Downloading Items From a Collection

In [23]:
# Downloading specific items from a collection, if successful will return True, else will return an error
item = internetarchive.get_item('lettertowilliaml00doug')
marc = item.get_file('lettertowilliaml00doug_marc.xml')
marc.download()

True

In [21]:
# Writing a script that will download all the MARC records of each item in the BPL Antislavery Collection
import time

error_log = open('bpl-marcs-errors.log', 'a')

search = internetarchive.search_items('collection:bplscas')

for result in search:
    # Getting each item's ID from the collection
    itemid = result['identifier']
    item = internetarchive.get_item(itemid)
    
    # Using the ID to find the MARC record of each item
    marc = item.get_file(itemid + '_marc.xml')
    
    # Handling errors in downloads - proceed downloads instead of ending the loop in an error is encountered
    try:
        marc.download()
    
    # Log any files that were unable to be downloaded to an error log
    except Exception as e:
        error_log.write('Could not download ' + itemid + ' because of error: %s\n' % e)
        print("There was an error; writing to log.")
    else:
        print("Downloading " + itemid + " ...")
        
        # Pause for one second before proceeding, to not overload IA's servers
        time.sleep(1)

Downloading 100conventionsma00mays ...
Downloading 16thnationalanti00chap ...
There was an error; writing to log.
Downloading abolitionismrevi00lewi ...
Downloading abolitionist00newe_0 ...
Downloading abolitionist00newe_1 ...
Downloading abolitionist00newe_2 ...
Downloading abolitionist1833newe ...
Downloading abolitionist1833newe10 ...
Downloading abolitionist1833newe11 ...
Downloading abolitionist1833newe12 ...
Downloading abolitionist1833newe2 ...
Downloading abolitionist1833newe3 ...
Downloading abolitionist1833newe4 ...
Downloading abolitionist1833newe5 ...
Downloading abolitionist1833newe6 ...
Downloading abolitionist1833newe7 ...
Downloading abolitionist1833newe8 ...
Downloading abolitionist1833newe9 ...


KeyboardInterrupt: 

# Parsing Information from MARC.xml Files 

In [24]:
# Parsing info about the location of these publications from the MARC records using pymarc
import pymarc

def get_place_of_pub(record):
    place_of_pub = record['260']['a']
    print(place_of_pub)

pymarc.map_xml(get_place_of_pub, 'lettertowilliaml00doug_marc.xml')

Belfast, [Northern Ireland],


In [29]:
# Improving our location gathering script

import os
import pymarc

# Setting path to where our MARC records our located
path = '/Users/vivia/Downloads/cls161/cls161_fall23/lesson_1_files/'

def get_place_of_pub(record):
    try:
        place_of_pub = record['260']['a']
        print(place_of_pub)
    
    except Exception as e:
        print(e)

for file in os.listdir(path):
    if file.endswith('.xml'):
        pymarc.map_xml(get_place_of_pub, path + file)

[S.l.],

[New Orleans] :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
[Boston :
Belfast, [Northern Ireland],
