# Skimming LCSH data from Gale files

The goal of this notebook is to figure out a quick way to skim the thousands of ECCO and NCCO file metadata from Gale-Cengage, discover if they have a relevant Library of Congress Subject Heading, and, if the file does have a relevant LCSH, add it to a list.

The terms I am looking for include `'travel', 'discov', 'explor', 'voyage', 'guide', 'antiquit'`. These are the same terms that I used to search the HTRC.

In a past notebook, I struggled with using xml, in part because it was very slow to open an xml file, read it in, and then see if what I wanted was there. This [lxml method from IBM](https://www.ibm.com/developerworks/xml/library/x-hiperfparse/), however, was helpful in developing something quicker.

In [1]:
from lxml import etree
import pandas as pd
import glob
from bs4 import BeautifulSoup

In [2]:
import lxml

In [3]:
context = etree.iterparse('files/sampleECCOtxt.xml', events=('end',), tag='locSubject')

loclist = []
for event, elem in context:
    print(elem.text)
    loclist.append(elem.text)
# Let's just try printing these in a few different ways

Westminster (London, England)
History
Early works to 1800
London (England)
History
Early works to 1800


In [4]:
string = '; '.join(loclist)
string

'Westminster (London, England); History; Early works to 1800; London (England); History; Early works to 1800'

In [5]:
for event, elem in context:
    print(elem.text)
    # Um, unsure why this isn't working, when it works above?

In [6]:
loclist

['Westminster (London, England)',
 'History',
 'Early works to 1800',
 'London (England)',
 'History',
 'Early works to 1800']

I should also not have to worry if the tag does not exist; as can be seen below, it won't break or throw an derror - the result will just be empy. So, I should still be able to apply my general method of something like `if word is in termlist`.

In [7]:
contextnone = etree.iterparse('files/sampleECCOtxt.xml', events=('end',), tag='holdings2')

In [8]:
emptylist = []
for event, elem in contextnone:
    emptylist.append(elem.text)
emptylist

[]

### Running over multiple files
 
Let's try run this over a few files to see what happens. The main chunk of code below was originally running BeautifulSoup, which took way too long; hopefully this will move a bit more quickly!

I will also need a list of metadata tags to iterate through, in order to grab the metadata for the files that I want. Let's do that first.

I also have to remember that although this chunk of code should work for ECCO files, I may have to adjust it slightly to work with NCCO files, especially since lxml tags are sensitive to capital letters (or at least, they are they are used above).

In [9]:
taglist = ['documentID', 'ESTCID', 'pubDate','ESTCID',
           'language','module','locSubject','notes',
           'fullTitle','displayTitle','currentVolume', 
           'totalVolumes', 'imprintPublisher','imprintFull',
           'imprintCity', 'publicationPlace']

In [11]:
# set up a dict to hold all the strings of all these elements
elementdict = {}

for xmltag in taglist:
    elementlist = []
    context = etree.iterparse('files/sampleECCOtxt.xml', events=('end',), tag=xmltag)
    for event, elem in context:
        elementlist.append(elem.text)
    elementdict[xmltag] = ', '.join(elementlist)

    

In [12]:
#aaaand, let's see if it worked
elementdict

{'ESTCID': 'T228085',
 'currentVolume': 'Volume 2',
 'displayTitle': 'A new and compleat survey of London. In ten parts. I. All the publick transactions and memorable events, that have happened to the citizens, from ...',
 'documentID': '1299400102',
 'fullTitle': 'A new and compleat survey of London. In ten parts. I. All the publick transactions and memorable events, that have happened to the citizens, from its foundation, to the year 1742. II. A particular description of the thirteen wards on the East of Walbrook. III. Of the twelve wards on the West of Walbrook. IV. A political account of London; parallels between this and the most celebrated cities of antiquity, as well as the modern great cities of Europe, Asia and Africa. V. An historical account of the city governments, ecclesiastical, civil and military. VI. A full account of the great and extensive commerce of the city; and of the several incorporations of the arts and mysteries of the citizens. VII. Of the present state of le

Great! And if we set up a file list and try to do two files, one with travel, one without, let's see if we can get that working...

In [13]:
twofiles = ['files/sampleECCOtextTravel.xml', 'files/sampleECCOtxt.xml']

In [14]:
# first, we only want the files that have certain LCSH
# so let's set up a list of tags
termlist = ['travel', 'discov', 'explor', 'voyage', 'guide', 'antiquit']

teststring = 'Italy, Description and travel, Early works to 1800'


In [15]:
# just to make sure that my if statement will work:
if any(x in teststring for x in termlist):
    print('yes')
else:
    print('no')

yes


In [17]:
# this list of dicts will hold all a dict that holds
# the metadata for each relevant file
listofdicts = []

for file in twofiles:
    testparse = etree.iterparse(file, events=('end',), tag = 'locSubject')
    testlist = []
    for event, elem in testparse:
        testlist.append(elem.text)
    # reclaim the memory at the end of each loop -
    # clears unneeded node references
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
        
    teststring = ', '.join(testlist) 
    print(teststring)
        
    if any(lcsh in teststring for lcsh in termlist):
        print('yes')
        
        # if that is true, we want the metadata
        # so let's make a dict to hold it
        # this dict will be reset with every file loop
        filedict = {}
        
        for xmltag in taglist:
            # make an empty list to hold what is in each tag, 
            # which will be written to our dict in a few steps
            elementlist = []
            
            context = etree.iterparse(file, events=('end',), tag=xmltag)
            for event, elem in context:
                elementlist.append(elem.text)
            
            # the below should make things faster - I think? 
            # reclaim the memory at the end of each loop -
            # clears unneeded node references
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]
            # assign to the dictionary
            filedict[xmltag] = ', '.join(elementlist)
            
        # after it has looped through all the xmltags, 
        # add the filedict to list of dicts 
        listofdicts.append(filedict)
    else:
        print('no')
            

Italy, Description and travel, Early works to 1800
yes
Westminster (London, England), History, Early works to 1800, London (England), History, Early works to 1800
no


In [18]:
listofdicts

[{'ESTCID': 'T110070',
  'currentVolume': '0',
  'displayTitle': "The travels of the learned Father Montfaucon from Paris thro' Italy. Containing I. An Account of many Antiquities at Vienne, Arles, Nismes, and ...",
  'documentID': '0084800600',
  'fullTitle': "The travels of the learned Father Montfaucon from Paris thro' Italy. Containing I. An Account of many Antiquities at Vienne, Arles, Nismes, and Marseilles in France. II. The Delights of Italy, viz. Libraries, Manuscripts, Statues, Paintings, Monuments, Tombs, Inscriptions, Epitaphs, Temples, Monasteries, Churches, Palaces, and other Curious Structures. III. Collections of Rarities, wonderful Subterraneous Passages and Burial-Places, old Roads, Gates, &c. with the Description of a Noble Monument found under Ground at Rome Anno M.DCC.II. Made English from the Paris edition. Adorn'd with Cuts.",
  'imprintCity': 'London',
  'imprintFull': "London : printed by D. L. for E. Curll at the Dial and Bible against St. Dunstan's-Church, E.

In [19]:
dictsdf = pd.DataFrame(listofdicts)

In [20]:
dictsdf

Unnamed: 0,ESTCID,currentVolume,displayTitle,documentID,fullTitle,imprintCity,imprintFull,imprintPublisher,language,locSubject,module,notes,pubDate,publicationPlace,totalVolumes
0,T110070,0,The travels of the learned Father Montfaucon f...,84800600,The travels of the learned Father Montfaucon f...,London,London : printed by D. L. for E. Curll at the ...,printed by D. L. for E. Curll at the Dial and ...,English,"Italy, Description and travel, Early works to ...",History and Geography,With an index. In this issue the imprint date...,17120101,London,0
