# Skimming LCSH data from Gale files

The goal of this notebook is to figure out a quick way to skim the thousands of ECCO and NCCO file metadata from Gale-Cengage, discover if they have a relevant Library of Congress Subject Heading, and, if the file does have a relevant LCSH, add it to a list.

The terms I am looking for include `'travel', 'discov', 'explor', 'voyage', 'guide', 'antiquit'`. These are the same terms that I used to search the HTRC.

In a past notebook, I struggled with using xml, in part because it was very slow to open an xml file, read the whole thing in to BeautifulSoup, and then see if what I wanted was there. This [lxml method from IBM](https://www.ibm.com/developerworks/xml/library/x-hiperfparse/), however, was helpful in developing something quicker.

In [1]:
from lxml
import pandas as pd
import glob
import lxml

In [4]:
%%time 

context = etree.iterparse('files/sampleECCOtxt.xml', events=('end',), tag='locSubject')

loclist = []
for event, elem in context:
    loclist.append(elem.text)
# Let's just try printing these in a few different ways

Wall time: 866 ms


In [4]:
string = '; '.join(loclist)
string

'Westminster (London, England); History; Early works to 1800; London (England); History; Early works to 1800'

In [5]:
for event, elem in context:
    print(elem.text)
    # Um, unsure why this isn't working, when it works above?

In [6]:
loclist

['Westminster (London, England)',
 'History',
 'Early works to 1800',
 'London (England)',
 'History',
 'Early works to 1800']

I should also not have to worry if the tag does not exist; as can be seen below, it won't break or throw an derror - the result will just be empy. So, I should still be able to apply my general method of something like `if word is in termlist`.

In [7]:
contextnone = etree.iterparse('files/sampleECCOtxt.xml', events=('end',), tag='holdings2')

In [8]:
emptylist = []
for event, elem in contextnone:
    emptylist.append(elem.text)
emptylist

[]

### Running over multiple files
 
Let's try run this over a few files to see what happens. The main chunk of code below was originally running BeautifulSoup, which took way too long; hopefully this will move a bit more quickly!

I will also need a list of metadata tags to iterate through, in order to grab the metadata for the files that I want. Let's do that first.

I also have to remember that although this chunk of code should work for ECCO files, I may have to adjust it slightly to work with NCCO files, especially since lxml tags are sensitive to capital letters (or at least, they are they are used above).

In [11]:
taglist = ['documentID', 'ESTCID', 'pubDate','ESTCID',
           'language','module','locSubject','notes',
           'fullTitle','displayTitle','currentVolume', 
           'totalVolumes', 'imprintPublisher','imprintFull',
           'imprintCity', 'publicationPlace']

In [11]:
# set up a dict to hold all the strings of all these elements
elementdict = {}

for xmltag in taglist:
    elementlist = []
    context = etree.iterparse('files/sampleECCOtxt.xml', events=('end',), tag=xmltag)
    for event, elem in context:
        elementlist.append(elem.text)
    elementdict[xmltag] = ', '.join(elementlist)

    

In [12]:
#aaaand, let's see if it worked
elementdict

{'ESTCID': 'T228085',
 'currentVolume': 'Volume 2',
 'displayTitle': 'A new and compleat survey of London. In ten parts. I. All the publick transactions and memorable events, that have happened to the citizens, from ...',
 'documentID': '1299400102',
 'fullTitle': 'A new and compleat survey of London. In ten parts. I. All the publick transactions and memorable events, that have happened to the citizens, from its foundation, to the year 1742. II. A particular description of the thirteen wards on the East of Walbrook. III. Of the twelve wards on the West of Walbrook. IV. A political account of London; parallels between this and the most celebrated cities of antiquity, as well as the modern great cities of Europe, Asia and Africa. V. An historical account of the city governments, ecclesiastical, civil and military. VI. A full account of the great and extensive commerce of the city; and of the several incorporations of the arts and mysteries of the citizens. VII. Of the present state of le

Great! And if we set up a file list and try to do two files, one with travel, one without, let's see if we can get that working...

In [7]:
twofiles = ['files/sampleECCOtextTravel.xml', 'files/sampleECCOtxt.xml']

In [9]:
# first, we only want the files that have certain LCSH
# so let's set up a list of tags
termlist = ['travel', 'discov', 'explor', 'voyage', 'guide', 'antiquit']

teststring = 'Italy, Description and travel, Early works to 1800'


In [15]:
# just to make sure that my if statement will work:
if any(x in teststring for x in termlist):
    print('yes')
else:
    print('no')

yes


## _note_ Run the next cell before the below ones
Make sure to establish the `taglist` and `termlist`

In [2]:
import pandas as pd
import glob
import lxml

taglist = ['documentID', 'ESTCID', 'pubDate','ESTCID',
           'language','module','locSubject','notes',
           'fullTitle','displayTitle','currentVolume', 
           'totalVolumes', 'imprintPublisher','imprintFull',
           'imprintCity', 'publicationPlace']
termlist = ['travel', 'discov', 'explor', 'voyage', 'guide', 'antiquit']

The next three cells have variations of:
1. two files, one with travel and one without: `8.82 s ± 2.84 s per loop (mean ± std. dev. of 10 runs, 1 loop each)`
2. one file, with travel: `7.18 s ± 751 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)`
3. one file, without travel: `2.78 s ± 334 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)`

So, in general, if I have 3000 files and 300 of them are travel related (as was roughly the case below), it would take 9,720 s (162 minutes / 2.7 hours).

Is there a way to do this more quickly? At this rate, since there is about 200,000 files on ECCO I and ECCO II, it will take 680,000 seconds, or 180 hours. Ouch. 

Of course, some of these sections - like medicine or Lit&Lang, will be much lower in the number of travel texts that they have, so that number of hours could drop by a lot. For example, if only 2% (rather than 10%) of all 200,000 files have a travel tag, the time then becomes 576,800 or 160ish hours. 

These files are on my local hard drive, and not on the Gale hard drive, so I am guessing these loops might go a bit faster than when I'm using the Gale hard drive? Google did not provide an easy answer to this.

In [6]:
%%timeit -r 10

# this list of dicts will hold all a dict that holds
# the metadata for each relevant file
listofdicts = []

twofiles = ['files/sampleECCOtextTravel.xml', 'files/sampleECCOtxt.xml']

for file in twofiles:
    testparse = etree.iterparse(file, events=('end',), tag = 'locSubject')
    testlist = []
    for event, elem in testparse:
        testlist.append(elem.text)
    # reclaim the memory at the end of each loop -
    # clears unneeded node references
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
        
    teststring = ', '.join(testlist) 
    # print(teststring)
        
    if any(lcsh in teststring for lcsh in termlist):
        # print('yes')
        
        # if that is true, we want the metadata
        # so let's make a dict to hold it
        # this dict will be reset with every file loop
        filedict = {}
        
        for xmltag in taglist:
            # make an empty list to hold what is in each tag, 
            # which will be written to our dict in a few steps
            elementlist = []
            
            context = etree.iterparse(file, events=('end',), tag=xmltag)
            for event, elem in context:
                elementlist.append(elem.text)
            
            # the below should make things faster - I think? 
            # reclaim the memory at the end of each loop -
            # clears unneeded node references
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]
            # assign to the dictionary
            filedict[xmltag] = ', '.join(elementlist)
            
        # after it has looped through all the xmltags, 
        # add the filedict to list of dicts 
        listofdicts.append(filedict)
    #else:
        # print('no')

8.82 s ± 2.84 s per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [7]:
%%timeit -r 10

# this list of dicts will hold all a dict that holds
# the metadata for each relevant file
listofdicts = []

travellist = ['files/sampleECCOtextTravel.xml']

for file in travellist:
    testparse = etree.iterparse(file, events=('end',), tag = 'locSubject')
    testlist = []
    for event, elem in testparse:
        testlist.append(elem.text)
    # reclaim the memory at the end of each loop -
    # clears unneeded node references
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
        
    teststring = ', '.join(testlist) 
    # print(teststring)
        
    if any(lcsh in teststring for lcsh in termlist):
        # print('yes')
        
        # if that is true, we want the metadata
        # so let's make a dict to hold it
        # this dict will be reset with every file loop
        filedict = {}
        
        for xmltag in taglist:
            # make an empty list to hold what is in each tag, 
            # which will be written to our dict in a few steps
            elementlist = []
            
            context = etree.iterparse(file, events=('end',), tag=xmltag)
            for event, elem in context:
                elementlist.append(elem.text)
            
            # the below should make things faster - I think? 
            # reclaim the memory at the end of each loop -
            # clears unneeded node references
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]
            # assign to the dictionary
            filedict[xmltag] = ', '.join(elementlist)
            
        # after it has looped through all the xmltags, 
        # add the filedict to list of dicts 
        listofdicts.append(filedict)
    #else:
       # print('no')
            

7.18 s ± 751 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [8]:
%%timeit -r 10

# this list of dicts will hold all a dict that holds
# the metadata for each relevant file
listofdicts = []

nontravellist = ['files/sampleECCOtxt.xml']

for file in nontravellist:
    testparse = etree.iterparse(file, events=('end',), tag = 'locSubject')
    testlist = []
    for event, elem in testparse:
        testlist.append(elem.text)
    # reclaim the memory at the end of each loop -
    # clears unneeded node references
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
        
    teststring = ', '.join(testlist) 
    # print(teststring)
        
    if any(lcsh in teststring for lcsh in termlist):
        #print('yes')
        
        # if that is true, we want the metadata
        # so let's make a dict to hold it
        # this dict will be reset with every file loop
        filedict = {}
        
        for xmltag in taglist:
            # make an empty list to hold what is in each tag, 
            # which will be written to our dict in a few steps
            elementlist = []
            
            context = etree.iterparse(file, events=('end',), tag=xmltag)
            for event, elem in context:
                elementlist.append(elem.text)
            
            # the below should make things faster - I think? 
            # reclaim the memory at the end of each loop -
            # clears unneeded node references
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]
            # assign to the dictionary
            filedict[xmltag] = ', '.join(elementlist)
            
        # after it has looped through all the xmltags, 
        # add the filedict to list of dicts 
        listofdicts.append(filedict)
    #else:
        #print('no')
            

2.78 s ± 334 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [18]:
listofdicts

[]

In [19]:
dictsdf = pd.DataFrame(listofdicts)

In [20]:
dictsdf

Unnamed: 0,ESTCID,currentVolume,displayTitle,documentID,fullTitle,imprintCity,imprintFull,imprintPublisher,language,locSubject,module,notes,pubDate,publicationPlace,totalVolumes
0,T110070,0,The travels of the learned Father Montfaucon f...,84800600,The travels of the learned Father Montfaucon f...,London,London : printed by D. L. for E. Curll at the ...,printed by D. L. for E. Curll at the Dial and ...,English,"Italy, Description and travel, Early works to ...",History and Geography,With an index. In this issue the imprint date...,17120101,London,0


## Applying it to all the files

Now that I have this code working, I should be able to apply it to all the other files.

I've asked an astute colleague (thanks Jonathan!) to look over the above, in case there is some error that would cause something to cascade and run forever (or, just run really slow). That's most likely to happen because of my unfamiliarity with lxml. Pandas, I think, should be able to handle what I'm asking of it.

So now, step one of this new task: get a file list. 

We'll start with the HistAndGeo section of ECCOII, since it has only 3385 files as opposed to the ~14000 files of ECCOI (ECCO was released in two segments). I'm also fairly hopeful that there should be at least one or two pieces of travel writing in there, since it is the ECCO genre most closely related to travel writing.

In [21]:
filelist = glob.glob('D:/ECCOII 2001/HistAndGeo/XML/*.xml')

In [22]:
len(filelist)

3384

In [26]:
shortfilelist = filelist[:100]
shortfilelist[:5]

['D:/ECCOII 2001/HistAndGeo/XML\\1299100101.xml',
 'D:/ECCOII 2001/HistAndGeo/XML\\1299100102.xml',
 'D:/ECCOII 2001/HistAndGeo/XML\\1299100103.xml',
 'D:/ECCOII 2001/HistAndGeo/XML\\1299100200.xml',
 'D:/ECCOII 2001/HistAndGeo/XML\\1299100301.xml']

Okay, now I need to replicate the code up above so that it will work on this larger batch.

I'm thinking, as well, of things that I will have to watch for: where will these files overlap with the ones in mybib already? I will have to be careful when integrating and comparing my various points of data, but the volume information and the file id numbering system (ie, having 0s vs 1/2/etc on the end of the filename) should help!

So, let's modify the code that I used earlier.

In particular, I want to add a line that will note which ECCO segment it came from - ECCOI or ECCOII.

In [27]:
# this list of dicts will hold all a dict that holds
# the metadata for each relevant file
listofdicts = []

for file in shortfilelist:
    
    # the first iterparse will test to see if it has the desired lcsh.
    testparse = etree.iterparse(file, events=('end',), tag = 'locSubject')
    testlist = []
    for event, elem in testparse:
        testlist.append(elem.text)
        
    # the below will reclaim the memory at the end of each loop -
    # clears unneeded node references
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
        
    # and back to the purpose of our code - 
    # note that putting it in a string makes it easier to search
    # comparing list items required an exact match,
    # and I wanted fuzzier searching, 
    # just in case there were any errors in the controlled vocabulary
    # of the lcsh.
    teststring = ', '.join(testlist)     
    if any(lcsh in teststring for lcsh in termlist):
        
        # if that is true, we want the metadata for that file
        # so let's make a dict to hold it
        # this dict will be reset with every file loop
        filedict = {}
        
        # a dict entry to indicate which ECCO release it came from
        filedict['eccorelease'] = '2'
        
        for xmltag in taglist:
            # make an empty list to hold what is in each tag, 
            # which will be written to our dict in a few steps
            elementlist = []
            
            context = etree.iterparse(file, events=('end',), tag=xmltag)
            for event, elem in context:
                elementlist.append(elem.text)
            
            # the below should make things faster - I think? 
            # reclaim the memory at the end of each loop -
            # clears unneeded node references
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]
            # assign to the dictionary
            filedict[xmltag] = ', '.join(elementlist)
            
        # after it has looped through all the xmltags, 
        # add the filedict to list of dicts 
        listofdicts.append(filedict)            

In [28]:
dictsdf = pd.DataFrame(listofdicts)

In [29]:
dictsdf

Unnamed: 0,ESTCID,currentVolume,displayTitle,documentID,fullTitle,imprintCity,imprintFull,imprintPublisher,language,locSubject,module,notes,pubDate,publicationPlace,totalVolumes
0,N025387,Volume 16,"The World displayed; Or, A curious collection ...",1309700116,"The World displayed; Or, A curious collection ...",London,"London : Printed for J. Newbery, at the Bible ...","Printed for J. Newbery, at the Bible and Sun, ...",English,Voyages and travels,History and Geography,,17670101,London,1
1,N031356,0,An essay towards a description of the city of ...,1309700600,An essay towards a description of the city of ...,[Bath],"[Bath] : Printed for W. Frederick, bookseller,...","Printed for W. Frederick, bookseller, in Bath",English,"Bath (England), Description and travel, Early ...",History and Geography,With an additional titlepage for pt. 1: 'An es...,17420101,Bath,0
2,N029227,0,"Interesting account of the early voyages, made...",1309900200,"Interesting account of the early voyages, made...",London,"London : Printed for the proprietors, and sold...","Printed for the proprietors, and sold at Stalk...",English,"Explorers, Portugal, Early works to 1800, Expl...",History and Geography,,17900101,London,0
3,N031489,Volume 1,An entertaining journey to the Netherlands; co...,1309900301,An entertaining journey to the Netherlands; co...,London,"London : printed for W. Smith, M DCC LXXXII. [...",printed for W. Smith,English,"Netherlands, Description and travel, Early wor...",History and Geography,Coriat Junior = Samuel Paterson?.,17820101,London,3
4,N031489,Volume 2,An entertaining journey to the Netherlands; co...,1309900302,An entertaining journey to the Netherlands; co...,London,"London : printed for W. Smith, M DCC LXXXII. [...",printed for W. Smith,English,"Netherlands, Description and travel, Early wor...",History and Geography,Coriat Junior = Samuel Paterson?.,17820101,London,3
5,N031489,Volume 3,An entertaining journey to the Netherlands; co...,1309900303,An entertaining journey to the Netherlands; co...,London,"London : printed for W. Smith, M DCC LXXXII. [...",printed for W. Smith,English,"Netherlands, Description and travel, Early wor...",History and Geography,Coriat Junior = Samuel Paterson?.,17820101,London,3
6,T188570,0,"A New collection of voyages and travels, dedic...",1309901100,"A New collection of voyages and travels, dedic...",London,"London : Printed for E. Newbery, the corner of...","Printed for E. Newbery, the corner of St. Paul...",English,"Voyages and travels, Early works to 1800",History and Geography,William Mavor is the editor of the 'Historical...,17960101,London,0
7,T172346,0,"Miscellaneous remarks made on the spot, in a l...",1310300400,"Miscellaneous remarks made on the spot, in a l...",London,"London : Printed for S. Hooper, at Gay's Head,...","Printed for S. Hooper, at Gay's Head, near Bea...",English,"Italy, Description and travel, Early works to ...",History and Geography,,17560101,London,0


Hurrah, it worked! Now, to replicate the code and do it for the multiple sections of ECCO II. 

# ECCO Part 2
Let's run our analysis on ECCO pt 2 (there are less files here than on ECCO pt 1!)

In [1]:
from lxml import etree
import pandas as pd
import glob

Because many of the files are 

In [52]:
import os
filelist = []
for root, dirs, files in os.walk(mypath):
    for file in files:
        if file.endswith(".xml"):
             filelist.append(os.path.join(root, file))

In [53]:
len(filelist)

52690

In [65]:
filelist[:10]

['D:/ECCOII 2001/GenRef\\XML\\1336600100.xml',
 'D:/ECCOII 2001/GenRef\\XML\\1336600200.xml',
 'D:/ECCOII 2001/GenRef\\XML\\1336600300.xml',
 'D:/ECCOII 2001/GenRef\\XML\\1336600400.xml',
 'D:/ECCOII 2001/GenRef\\XML\\1336600500.xml',
 'D:/ECCOII 2001/GenRef\\XML\\1336600600.xml',
 'D:/ECCOII 2001/GenRef\\XML\\1336600700.xml',
 'D:/ECCOII 2001/GenRef\\XML\\1336600800.xml',
 'D:/ECCOII 2001/GenRef\\XML\\1336600900.xml',
 'D:/ECCOII 2001/GenRef\\XML\\1336601000.xml']

Looks good - there shouldn't be any subfolders or word docs, for example. 

I am...a little hesitant to run something on such a large number of files. What if it breaks, partway through? What if my laptop panics? I think, however, this is a good opportunity to take a break, make some hot chocolate, watch the snow fall, and let my code run (and then, maybe, return to writing?!)

### UH OH.

*revised plan* Okay, so I started it running, and then let it be. And, of course, something didn't work and it threw a "file not found" error. :( The weird thing, is that is didn't happen on the `testparse = etree` line like I would have expected; instead, it happend on the first `for event, elem in testparse: testlist.append(elem.text)`.

So, I'm going to return to doing it in batches; at least, then, I will know whether it is happening in a certain folder. Grumble grumble.

(side note: I am a little concerned about what if the capitalization happens to be different somewhere? But, I think I can just run a `locsubject` (without the capital S) to catch any differences...)

In order to do a longer test, I'll use the `ECCOII HistAndGeo` subsection as an experiment - there were about 3300 files in there.  

_side note_: My laptop went to sleep just after the 2500 mark, so I had to start it back up again, woops. No duplicates in the total of `312` relevant files below, though I will have to filter out anything that isn't printed in Great Britain - there are some texts printed in New York, Dublin, etc., in there.

In [2]:
filelistHistAndGeo = glob.glob('D:/ECCOII 2001/HistAndGeo/XML/*.xml')

In [19]:
termlist = ['travel', 'discov', 'explor', 'voyage', 'guide', 'antiquit']
taglist = ['documentID', 'ESTCID', 'pubDate','ESTCID',
           'language','module','locSubject','notes',
           'fullTitle','displayTitle','currentVolume', 
           'totalVolumes', 'imprintPublisher','imprintFull',
           'imprintCity', 'publicationPlace']

# this list of dicts will hold all a dict that holds
# the metadata for each relevant file

# listofdicts = []

# and, a count so that I can track my progress
count = 2500

for file in filelistHistAndGeo[2500:]:
    count+=1
    if (count % 500) == 0:
        print(count)
    
    
    # the first iterparse will test to see if it has the desired lcsh.
    testparse = etree.iterparse(file, events=('end',), tag = 'locSubject')
    testlist = []
    for event, elem in testparse:
        testlist.append(elem.text)
        
    # the below will reclaim the memory at the end of each loop -
    # clears unneeded node references
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
        
    # and back to the purpose of our code - 
    # note that putting it in a string makes it easier to search
    # comparing list items required an exact match,
    # and I wanted fuzzier searching, 
    # just in case there were any errors in the controlled vocabulary
    # of the lcsh.
    teststring = ', '.join(testlist)     
    if any(lcsh in teststring for lcsh in termlist):
        
        # if that is true, we want the metadata for that file
        # so let's make a dict to hold it
        # this dict will be reset with every file loop
        filedict = {}
        
        # a dict entry to indicate which ECCO release it came from
        filedict['eccorelease'] = '2'
        
        for xmltag in taglist:
            # make an empty list to hold what is in each tag, 
            # which will be written to our dict in a few steps
            elementlist = []
            
            context = etree.iterparse(file, events=('end',), tag=xmltag)
            for event, elem in context:
                elementlist.append(elem.text)
            
            # the below should make things faster - I think? 
            # reclaim the memory at the end of each loop -
            # clears unneeded node references
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]
            # assign to the dictionary
            filedict[xmltag] = ', '.join(elementlist)
            
        # after it has looped through all the xmltags, 
        # add the filedict to list of dicts 
        listofdicts.append(filedict)   
        


3000


In [20]:
len(listofdicts)

312

In [21]:
dfHistAndGeo = pd.DataFrame(listofdicts)
dfHistAndGeo

Unnamed: 0,ESTCID,currentVolume,displayTitle,documentID,eccorelease,fullTitle,imprintCity,imprintFull,imprintPublisher,language,locSubject,module,notes,pubDate,publicationPlace,totalVolumes
0,N025387,Volume 16,"The World displayed; Or, A curious collection ...",1309700116,2,"The World displayed; Or, A curious collection ...",London,"London : Printed for J. Newbery, at the Bible ...","Printed for J. Newbery, at the Bible and Sun, ...",English,Voyages and travels,History and Geography,,17670101,London,1
1,N031356,0,An essay towards a description of the city of ...,1309700600,2,An essay towards a description of the city of ...,[Bath],"[Bath] : Printed for W. Frederick, bookseller,...","Printed for W. Frederick, bookseller, in Bath",English,"Bath (England), Description and travel, Early ...",History and Geography,With an additional titlepage for pt. 1: 'An es...,17420101,Bath,0
2,N029227,0,"Interesting account of the early voyages, made...",1309900200,2,"Interesting account of the early voyages, made...",London,"London : Printed for the proprietors, and sold...","Printed for the proprietors, and sold at Stalk...",English,"Explorers, Portugal, Early works to 1800, Expl...",History and Geography,,17900101,London,0
3,N031489,Volume 1,An entertaining journey to the Netherlands; co...,1309900301,2,An entertaining journey to the Netherlands; co...,London,"London : printed for W. Smith, M DCC LXXXII. [...",printed for W. Smith,English,"Netherlands, Description and travel, Early wor...",History and Geography,Coriat Junior = Samuel Paterson?.,17820101,London,3
4,N031489,Volume 2,An entertaining journey to the Netherlands; co...,1309900302,2,An entertaining journey to the Netherlands; co...,London,"London : printed for W. Smith, M DCC LXXXII. [...",printed for W. Smith,English,"Netherlands, Description and travel, Early wor...",History and Geography,Coriat Junior = Samuel Paterson?.,17820101,London,3
5,N031489,Volume 3,An entertaining journey to the Netherlands; co...,1309900303,2,An entertaining journey to the Netherlands; co...,London,"London : printed for W. Smith, M DCC LXXXII. [...",printed for W. Smith,English,"Netherlands, Description and travel, Early wor...",History and Geography,Coriat Junior = Samuel Paterson?.,17820101,London,3
6,T188570,0,"A New collection of voyages and travels, dedic...",1309901100,2,"A New collection of voyages and travels, dedic...",London,"London : Printed for E. Newbery, the corner of...","Printed for E. Newbery, the corner of St. Paul...",English,"Voyages and travels, Early works to 1800",History and Geography,William Mavor is the editor of the 'Historical...,17960101,London,0
7,T172346,0,"Miscellaneous remarks made on the spot, in a l...",1310300400,2,"Miscellaneous remarks made on the spot, in a l...",London,"London : Printed for S. Hooper, at Gay's Head,...","Printed for S. Hooper, at Gay's Head, near Bea...",English,"Italy, Description and travel, Early works to ...",History and Geography,,17560101,London,0
8,T220401,Volume 1,"The memoirs of Charles-Lewis, Baron de Pollnit...",1313200301,2,"The memoirs of Charles-Lewis, Baron de Pollnit...",London,"London : Printed for Daniel Browne, at the Bla...","Printed for Daniel Browne, at the Black Swan, ...",English,"Europe, Description and travel",History and Geography,Translated by Stephen Whatley. In this editio...,17390101,London,2
9,T220401,Volume 2,"The memoirs of Charles-Lewis, Baron de Pollnit...",1313200302,2,"The memoirs of Charles-Lewis, Baron de Pollnit...",London,"London : Printed for Daniel Browne, at the Bla...","Printed for Daniel Browne, at the Black Swan, ...",English,"Europe, Description and travel",History and Geography,Translated by Stephen Whatley. In this editio...,17390101,London,2


In [22]:
dfHistAndGeo.to_csv('files/HistAndGeo.csv')

Okay, so, it worked (hurrah!) but it did take quite a while - a few hours. At this rate, I might have to leave do multiple sections in order to gather them all, unless there is a faster method (which there almost undoubtedly is).