#Using lxml and modsqual to parse and assess MODS XML
In this tutorial, we will learn the basics of parsing XML with the third-party [lxml](http://lxml.de/) library and assessing MODS XML metadata with [modsqual](https://github.com/saverkamp/modsqual), a module (in progress!) that simplifies lxml for working with MODS. 

###Installation
####lxml
If you installed the [Anaconda Python distribution](https://www.continuum.io/why-anaconda), good news! You already have lxml installed. If you aren't using Anaconda (and you already have Python and pip installed), you can install wil pip:

```
pip install lxml
```

More information on pip installation and other methods are available on the lxml [installation page](http://lxml.de/installation.html).

####modsqual
Modsqual requires lxml, so make sure you have that installed before attempting to work with the module. Modsqual also uses [xmltodict](https://github.com/martinblech/xmltodict), but this should be installed automatically with the modsqual install, if you don't have it already. xmltodict is a module that converts XML to JSON, if that's your thing. 

Install modsqual with pip:
```
pip install modsqual
```

###lxml
Let's read in some MODS XML data so we can see if our lxml installation worked. You should have a file named 'sample_mods_data.txt' in this folder. If you don't, you can download it [here](http://github.com/saverkamp/measure-metadata-workshop/MODSXMLandPython/sample_mods_data.txt).

In [125]:
f = open('sample_mods_data.txt')

This text file contains ~9000 MODS XML records from NYPL's Digital Collections with one record per line. You could use a properly-formed MODS record and nest them all within a <modsCollection> element, but it can be difficult to load such a giant XML document into memory and work with it. To save processing time, we're going to cheat and iterate line by line through this text file. 

Let's use the `readline()` method to read the first record/row into a variable called line. We'll print it, too, so we can see what we're working with:

In [126]:
line = f.readline()
print line

<?xml version="1.0" encoding="UTF-8"?><mods xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3" version="3.4" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd">  <titleInfo usage="primary" supplied="no">    <title>America Septentrionalis.</title>  </titleInfo>  <identifier type="local_hades_collection" displayLabel="Hades Collection Guide ID (legacy)">149</identifier>  <identifier type="local_hades" displayLabel="Hades struc ID (legacy)">252939</identifier>  <identifier type="local_catnyp" displayLabel="CATNYP ID (legacy)">b6987048</identifier>  <identifier type="local_other" displayLabel="RLIN/OCLC">48757232</identifier>  <identifier type="local_bnumber" displayLabel="NYPL catalog ID (B-number)">b15315142</identifier>  <location>    <physicalLocation authority="marcorg" type="repository">nn</physicalLocation>    <physicalLocation type="division">Lionel Pincus and Princess Firyal Map Division</physical

lxml has two modules for parsing XML, [etree](http://lxml.de/tutorial.html) and [objectify](http://lxml.de/objectify.html). etree provides access to XML structure and content through Python's native ElementTree API, while objectify provides more object-oriented access to your XML document. We're going to learn about the objectify module, but we will need to import both modules to use it:

In [127]:
from lxml import etree
from lxml import objectify

Let's use the `fromstring()` method to import our first line an lxml.objectify object called mods:

In [128]:
mods = objectify.fromstring(line)
mods

<Element {http://www.loc.gov/mods/v3}mods at 0x423ed48>

We can see that mods is an Element object that holds our root element `<mods>`. What can we do with our element object? Let's see:

In [129]:
dir(mods)

['__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__hash__',
 '__init__',
 '__iter__',
 '__len__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_init',
 'addattr',
 'addnext',
 'addprevious',
 'append',
 'attrib',
 'base',
 'clear',
 'countchildren',
 'descendantpaths',
 'extend',
 'find',
 'findall',
 'findtext',
 'get',
 'getchildren',
 'getiterator',
 'getnext',
 'getparent',
 'getprevious',
 'getroottree',
 'identifier',
 'index',
 'insert',
 'items',
 'iter',
 'iterancestors',
 'iterchildren',
 'iterdescendants',
 'iterfind',
 'itersiblings',
 'itertext',
 'keys',
 'location',
 'makeelement',
 'name',
 'note',
 'nsmap',
 'originInfo',
 'physicalDescription',
 'prefix',
 'relatedItem',
 'remove',
 'replace',
 'set',
 'so

Wow! That's a lot of attributes. You could read the documentation to learn about what all of these do, but when I'm learning a new module, I like to just try things out. 

In [130]:
mods.tag

'{http://www.loc.gov/mods/v3}mods'

If an attribute is a function, it will tell you, and then you can try it with a ():

In [131]:
mods.getchildren

<function getchildren>

In [132]:
mods.getchildren()

[<Element {http://www.loc.gov/mods/v3}titleInfo at 0x423c688>,
 149,
 252939,
 'b6987048',
 48757232,
 'b15315142',
 <Element {http://www.loc.gov/mods/v3}location at 0x426fd48>,
 <Element {http://www.loc.gov/mods/v3}location at 0x426f148>,
 <Element {http://www.loc.gov/mods/v3}name at 0x426fb48>,
 'still image',
 <Element {http://www.loc.gov/mods/v3}originInfo at 0x426f308>,
 <Element {http://www.loc.gov/mods/v3}physicalDescription at 0x426fdc8>,
 'California shown as an island.',
 'Includes decorative cartouche showing native North Americans and ill. of animals in the body of the map.',
 'National Endowment for the Humanities Grant for Access to Early Maps of the Middle Atlantic Seaboard.',
 'Relief shown pictorially.',
 'Title of the text on verso: Beschryvinge van het Noorder Deel van America ; signature: Bb.',
 'Koemen, C. Atlantes Neerlandici, II, p. 431',
 <Element {http://www.loc.gov/mods/v3}subject at 0x426ff48>,
 <Element {http://www.loc.gov/mods/v3}subject at 0x4289048>,
 '5d

To find out more about an attribute, use the help() function:

In [133]:
help(mods.getchildren)

Help on built-in function getchildren:

getchildren(...)
    getchildren(self)
    
    Returns a sequence of all direct children.  The elements are
    returned in document order.



We can see that `getchildren()` returned a list of it's direct children. If direct children contain text values, it returns the text. If the child contains other elements, it returns the child as another objectify Element. (Looks like this method might be more useful for elements with less stuff!) 

Use this next cell to explore some more attributes of the `mods` Element object

You might have noticed the names of [top-level MODS elements](https://www.loc.gov/standards/mods/userguide/generalapp.html) amongst our `mods` object's attributes. This means you can access child elements (or "objects") through "dot" notation.

In [134]:
mods.titleInfo.title

'America Septentrionalis.'

In [135]:
mods.originInfo.dateIssued

1639

In [136]:
mods.identifier

149

Hmm, that's strange. We know that have more than one identifier in our record, but we're only seeing one. When we try to call elements through dot notation, we're only given the text for the first instance of the element. We need to iterate through the attribute if we want to get all values or properties of each instance. 

In [137]:
[i for i in mods.identifier]

[149,
 252939,
 'b6987048',
 48757232,
 'b15315142',
 '5db7ad80-c52a-012f-0a4c-3c075448cc4b']

You can find specific values in an Element object using the `find()` or `findall()` methods:

In [91]:
help(mods.find)

Help on built-in function find:

find(...)
    find(self, path, namespaces=None)
    
    Finds the first matching subelement, by tag name or path.
    
    The optional ``namespaces`` argument accepts a
    prefix-to-namespace mapping that allows the usage of XPath
    prefixes in the path expression.



In [94]:
mods.tag

'{http://www.loc.gov/mods/v3}mods'

If you've worked with XSLT, you'll be happy to learn you can use xpath to find values. In this example we'll use an xpath to find the identifier of type "uuid". Note that because we've declared namespaces in the root element, we have to use them in our xpaths. We can map the full namespace to a prefix in the `namespaces` argument.

In [138]:
mods.xpath('m:identifier[@type="uuid"]', namespaces={'m':'http://www.loc.gov/mods/v3'})

['5db7ad80-c52a-012f-0a4c-3c075448cc4b']

Also note that it returns our results in a list (even if there's only one result). Let's try another. This xpath is looking for the text of the title that has `usage` attribute marked as "primary":

In [139]:
mods.xpath('m:titleInfo[@usage="primary"]/m:title', namespaces={'m':'http://www.loc.gov/mods/v3'})

['America Septentrionalis.']

Use this space to try a few more xpath searches. If you're new to xpath or need to brush up, [these examples](https://msdn.microsoft.com/en-us/library/ms256086) are helpful.

##modsqual

lxml is a very powerful tool for parsing XML, but I've found the learning curve to be a bit steep. It's also a lot of typing that I don't really want to do. So, I created a wrapper around lxml that was more intuitive to use and tailored to my purposes in working with MODS metadata, primarily quality assessment. This module makes it easy to answer simple quality-related questions about an element like:
- is the element there?
- how many instances of the element are there?

It also makes it easier to compose xpath queries and also includes support for regex. It could still use a lot of work, so if you like what you see, consider contributing! 

Let's try some of the same examples we used in our lxml exercise. First let's close and reopen the text file and read in the first line.

In [141]:
f.close()

f = open('sample_mods_data.txt')
line = f.readline()
print line

<?xml version="1.0" encoding="UTF-8"?><mods xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3" version="3.4" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd">  <titleInfo usage="primary" supplied="no">    <title>America Septentrionalis.</title>  </titleInfo>  <identifier type="local_hades_collection" displayLabel="Hades Collection Guide ID (legacy)">149</identifier>  <identifier type="local_hades" displayLabel="Hades struc ID (legacy)">252939</identifier>  <identifier type="local_catnyp" displayLabel="CATNYP ID (legacy)">b6987048</identifier>  <identifier type="local_other" displayLabel="RLIN/OCLC">48757232</identifier>  <identifier type="local_bnumber" displayLabel="NYPL catalog ID (B-number)">b15315142</identifier>  <location>    <physicalLocation authority="marcorg" type="repository">nn</physicalLocation>    <physicalLocation type="division">Lionel Pincus and Princess Firyal Map Division</physical

We first need to load in the modsqual module:

In [142]:
import modsqual

Now let's create a mosdqual Mods object. This object will represent the whole MODS XML record.

In [144]:
mods = modsqual.Mods(line)

Let's see what we can do with our Mods object:

In [146]:
dir(mods)

['__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'abstract',
 'accessCondition',
 'classification',
 'counts',
 'extension',
 'genre',
 'getAbstract',
 'getAccessCondition',
 'getClassification',
 'getExtension',
 'getGenre',
 'getIdentifier',
 'getLanguage',
 'getLocation',
 'getName',
 'getNote',
 'getOriginInfo',
 'getPart',
 'getPhysicalDescription',
 'getRecordInfo',
 'getRelatedItem',
 'getSubject',
 'getTableOfContents',
 'getTargetAudience',
 'getTitleInfo',
 'getTypeOfResource',
 'identifier',
 'language',
 'location',
 'match',
 'mods',
 'name',
 'note',
 'originInfo',
 'part',
 'physicalDescription',
 'recordInfo',
 'relatedItem',
 'subject',
 'tableOfContents',
 'targetAudience',
 'titleInfo',
 'toplevels',
 'typeOfResource',
 'wf',
 'xml']

Here we can see that, like in the lxml example, our top-level MODS elements are listed as attributes. There are a few other attributes of our Mods object to note. 

We can see if our MODS XML is well-formed by calling the `wf` attribute:

In [147]:
mods.wf

True

We can get counts of each top-level element with the `counts` attribute:

In [148]:
mods.counts

{u'identifier': 6,
 u'location': 2,
 u'name': 1,
 u'note': 6,
 u'originInfo': 1,
 u'physicalDescription': 1,
 u'relatedItem': 1,
 u'subject': 2,
 u'titleInfo': 1,
 u'typeOfResource': 1}

If we want to view the XML, we can call the `xml()` method:

In [152]:
mods.xml()

'<mods xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3" version="3.4" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd">\n  <titleInfo usage="primary" supplied="no">\n    <title>America Septentrionalis.</title>\n  </titleInfo>\n  <identifier type="local_hades_collection" displayLabel="Hades Collection Guide ID (legacy)">149</identifier>\n  <identifier type="local_hades" displayLabel="Hades struc ID (legacy)">252939</identifier>\n  <identifier type="local_catnyp" displayLabel="CATNYP ID (legacy)">b6987048</identifier>\n  <identifier type="local_other" displayLabel="RLIN/OCLC">48757232</identifier>\n  <identifier type="local_bnumber" displayLabel="NYPL catalog ID (B-number)">b15315142</identifier>\n  <location>\n    <physicalLocation authority="marcorg" type="repository">nn</physicalLocation>\n    <physicalLocation type="division">Lionel Pincus and Princess Firyal Map Division</physicalLocation>\n    

What can we do with the top-level elements? 

In [154]:
dir(mods.titleInfo)

['__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'byattr',
 'count',
 'counts',
 'dict',
 'exists',
 'match',
 'mods',
 'name',
 'subels',
 'tag',
 'text',
 'xml']

You'll notice that we can't use dot notation for elements beyond the top-level element. (Pull requests welcome!) But we can do some other things.

Count the number of title elements

In [157]:
mods.titleInfo.count

1

Check that a top-level element exists:

In [179]:
mods.genre.exists

False

View the innermost text of all titleInfo elements:

In [159]:
mods.titleInfo.text()

[['America Septentrionalis.']]

Notice that we have a list inside of a list. This is because there might be multiple titleInfo elements and multiple elements within each titleInfo elements.

Let's view the text of location elements:

In [160]:
mods.location.text()

[['nn',
  'Lionel Pincus and Princess Firyal Map Division',
  'Map Division',
  'MAP'],
 ['Lionel Pincus and Princess Firyal Map Division',
  'Map Div. 01-5371 [Filed with North America, [1660] as originally cataloged in NYPL Dictionary Catalog of the Map Division]',
  'Map Division',
  'MAP']]

Let's view the XML of the all location elements:

In [162]:
mods.location.xml()

['<location xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">\n  <physicalLocation authority="marcorg" type="repository">nn</physicalLocation>\n  <physicalLocation type="division">Lionel Pincus and Princess Firyal Map Division</physicalLocation>\n  <physicalLocation type="division_short_name">Map Division</physicalLocation>\n  <physicalLocation type="code">MAP</physicalLocation>\n</location>\n',
 '<location xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">\n  <physicalLocation type="division">Lionel Pincus and Princess Firyal Map Division</physicalLocation>\n  <shelfLocator>Map Div. 01-5371 [Filed with North America, [1660] as originally cataloged in NYPL Dictionary Catalog of the Map Division]</shelfLocator>\n  <physicalLocation type="division_short_name">Map Division</physicalLocation>\n  <physicalLocation type="code">MAP</physicalLocation>\n</location>\n']

Note that this returns a list of XML fragments. 

We can also view elements as dictionaries. Let's view the location elements as dicts:

In [164]:
mods.location.dict

[OrderedDict([(u'location', OrderedDict([(u'@xmlns', u'http://www.loc.gov/mods/v3'), (u'@xmlns:xsi', u'http://www.w3.org/2001/XMLSchema-instance'), (u'physicalLocation', [OrderedDict([(u'@authority', u'marcorg'), (u'@type', u'repository'), ('#text', u'nn')]), OrderedDict([(u'@type', u'division'), ('#text', u'Lionel Pincus and Princess Firyal Map Division')]), OrderedDict([(u'@type', u'division_short_name'), ('#text', u'Map Division')]), OrderedDict([(u'@type', u'code'), ('#text', u'MAP')])])]))]),
 OrderedDict([(u'location', OrderedDict([(u'@xmlns', u'http://www.loc.gov/mods/v3'), (u'@xmlns:xsi', u'http://www.w3.org/2001/XMLSchema-instance'), (u'physicalLocation', [OrderedDict([(u'@type', u'division'), ('#text', u'Lionel Pincus and Princess Firyal Map Division')]), OrderedDict([(u'@type', u'division_short_name'), ('#text', u'Map Division')]), OrderedDict([(u'@type', u'code'), ('#text', u'MAP')])]), (u'shelfLocator', u'Map Div. 01-5371 [Filed with North America, [1660] as originally cat

Note that these are actualy OrderedDicts, so that we preserve element sequence. As you all know, this is important for certain elements, like subjects. This looks a little cluttered to the human eye, with all of those namespaces (pull requests welcome!), but if you know MODS structure by heart, you won't have to look at it much.

Get the text of the first location's physicalLocation:

In [178]:
locationdict = mods.location.dict
locationdict[0]['location']['physicalLocation'][0]['#text']

u'nn'

This is still a bit cumbersome, so let's try trusty old xpath. The `match` method allows you to return matching elements by attribute value, regular expression, or xpath. This function is a wrapper around the etree.XPath method and returns a list of xpath results. Depending on your query, this will either be a list of text values or a list of elements. You can then convert element results to Python dictionaries using the todict() function, if you like.

You do still have to preface element names with the namespace prefix 'm:' and you need to start your xpath with mods as the root (pull requests welcome!). Let's look for the text of the physicalLocation with attribute @type="division":

In [180]:
mods.location.match(xpath='./m:location/m:physicalLocation[@type="division"]')

['Lionel Pincus and Princess Firyal Map Division',
 'Lionel Pincus and Princess Firyal Map Division']

Let's try a more complex example. Let's see if we have at least one of the following date types: dateCreated, dateIssued, or copyrightDate:

In [181]:
mods.originInfo.match(xpath='./m:originInfo/m:dateCreated|./m:originInfo/m:dateIssued|./m:originInfo/m:copyrightDate')

[1639]

Try a few xpaths in the cell below:

We can also match with regex on text values of elements. Let's find any identifiers that start with 4 digits:

In [182]:
mods.identifier.match(regex='^[0-9]{4,}')

[252939, 48757232]

Sadly, this only works on top-level text elements right now:

In [183]:
mods.titleInfo.match(regex='^A')

[]

##Using modsqual for a baseline audit

Let's put all this together and walk through a script that iterates through our ~9000 records and tests for the following:
- At least one titleInfo is present
- At least one typeOfResource and all match a valid value
- At least one of dateCreated, dateIssued, or copyrightDate is present
- At least one genre is present
- At least one identifer of type "local_bnumber", "local_mss", or "local_tms" is present
- At least one physicalLocation of type "division" is present

We'll score each factor as 1 for True, and 0 for False. Then we'll add the six scores to create a total score.

We also need to collect some basic identifying information about each record, as well as some information that will help in grouping or filtering in a later analysis. 
- Record UUID (identifier[@type="uuid"]
- Parent collection name (innermost nested relatedItem/titleInfo/title)
- Division name (location/physicalLocation[@type="division"])

You can find the entire script on its own in this directory or [here](http://github.com/saverkamp/measure-metadata-workshop/MODSXMLandPython/sample_baseline_scoring.py).

First let's import the necessary Python modules:

In [186]:
import modsqual
import csv

Now let's create a list of the valid MODS resource types that we'll use for checking typeOfResource against:

In [187]:
resource_types = ["text", "cartographic", "notated music", "sound recording-musical", "sound recording-nonmusical",
              "sound recording", "still image", "moving image", "three dimensional object", "software, multimedia",
              "mixed material"]

These are some functions I've written to help us with our tests and to simplify generating scores for each test.

In [188]:
def score(bool, point=True, value=1):
    """assign points if bool value matches point argument value"""
    if bool == point:
        score = value
    else:
        score = 0
    return score

def exists(element):
    """top-level element exists"""
    s = score(element.exists)
    return s

def inList(element, list):
    """element value matches a value from a controlled vocabulary"""
    if element.exists == True:
        s = score(all(i in list for i in element.text()))
    else:
        s = 0
    return s

def xpathexists(match, min=1):
    """result from xpath match contains at least one element match"""
    try:
        s = score(len(match) >= min)
    except:
        s = 0
    return s

Using the csv module, let's set up an empty csv file that we'll use to write our results to. Let's also write a header row for the data we're collecting.

In [189]:
z = open('sample_mods_scores.csv', 'wb')
header = ['uuid', 'division', 'collection', 'title', 'typeOfResource', 'genre', 'date', 'identifier', 'location', 'total']
writer = csv.DictWriter(z, fieldnames=header)
writer.writeheader()

Make sure our MODS text file is open:

In [190]:
f = open('sample_mods_data.txt')

The next part of the script involves iterating through each record, so it needs to be in it's own code block. Before we run that, let's unpack what we'll be doing in each record. 

Read in the first line, as we did earlier, to walk through these next examples:

In [192]:
line = f.readline()
mods = modsqual.Mods(line)

First we'll find the record's uuid identifier using the `match()` function. You can also match elements with the `attr` argument, just list the attributes and their values in a list as the value of `attr`. The match function will return a list so we will want to get the first one by using index [0]:

In [193]:
uuids = mods.identifier.match(attr=['@type="uuid"'])
uuid = uuids[0]
print uuid

d7cac480-c52a-012f-b006-3c075448cc4b


To get the collection name, we need the innermost relatedItem element. We could write a function to do this the proper way, or we could just assume that if we write an xpath to return all return all relatedItem elements, it will be the last one. After we make sure that we actually have any relatedItem elements, we'll use the index [-1] to get the last item in the list

In [194]:
collections = mods.relatedItem.match(xpath='.//m:relatedItem/m:titleInfo/m:title/text()')
if len(collections) > 0:
    coll_name = collections[-1]
else:
    coll_name = 'Null'
print coll_name

Maps of North America.


Now let's run each of our 6 tests. For titleInfo and genre, we are just checking to see if these elements exist, which we know we can do with the `exists` attribute:

In [198]:
print mods.titleInfo.exists
print mods.genre.exists

True
False


And then we need to convert those booleans to scores. The `exists()` function (above, in the list of functions) saves us some typing by checking the `exists` attribute, then running the value through the `score()` function to give us a score, 1 or 0, for the test:

In [208]:
title = exists(mods.titleInfo)
genre = exists(mods.genre)
print title
print genre

1
0


For typeOfResource, we need to get the values of the element, then check each value against our list of valid resource types. If all values match, it scores a 1, if not all match, or there aren't any typeOfResource elements present, it scores a 0. The `inList()` function simplifies this test by checking first that the element exists (`mods.typeOfResource.exists`) then checking that each value is represented in the `resource_types` list. It uses the `score()` function to score the test.

In [202]:
#pass the typeOfResource element and the resource types list to the inList() function to get the score
typeOfResource = inList(mods.typeOfResource, resource_types)
print typeOfResource

1


For date, location, and identifier, since we are matching on specific criteria, we need a different test than `exists()`. The `xpathexists()` function takes the results of an xpath `match()` method, counts the number of matches (length of the list that `match()` returns, and uses `score()` to calculate the score.

In [203]:
date = xpathexists(mods.originInfo.match(xpath='./m:originInfo/m:dateCreated|./m:originInfo/m:dateIssued|./m:originInfo/m:copyrightDate'))
identifier = xpathexists(mods.identifier.match(xpath='./m:identifier[@type="local_bnumber" or @type="local_mss" or @type="local_tms"]'))
location = xpathexists(mods.location.match(xpath='./m:location/m:physicalLocation[@type="division"]'))
print date
print identifier
print location

1
1
1


We also wanted to collect the division name to use for filtering or grouping in a later analysis of these scores. Since we already ran our test on division location, we can use that to help get the division name. If that test fails, we want to call our division 'Null'. If it passed, we can use the same xpath match to get the division name. Note that we added an index for the first result, since we want the value of division to be a string, not a list (also that xpath might match more than one division!):

In [206]:
if (location == 1):
    division = mods.location.match(xpath='./m:location/m:physicalLocation[@type="division"]')[0]
else:
    division = 'Null'
print division

Lionel Pincus and Princess Firyal Map Division


Finally, let's calculate our total score. We'll put all of our test scores in a list and then use the `sum()` function to add them.

In [209]:
scores = [title, typeOfResource, genre, date, identifier, location]
total = sum(scores)
print total

5


We're ready to write our data to a row in the csv file. We're using the csv DictWriter, so we need to create a dictionary with the header labels as keys and our variables as values:

In [211]:
row = {'uuid':uuid, 'division':division, 'collection':coll_name, 'title':title, 'typeOfResource':typeOfResource, 'genre':genre,
       'date':date, 'identifier':identifier, 'location':location, 'total':total}
print row

{'division': 'Lionel Pincus and Princess Firyal Map Division', 'total': 5, 'date': 1, 'uuid': 'd7cac480-c52a-012f-b006-3c075448cc4b', 'title': 1, 'genre': 0, 'identifier': 1, 'typeOfResource': 1, 'collection': 'Maps of North America.', 'location': 1}


We'll write the row to the file using `writer.writerow(row)` but we won't do that right now.

We're just about ready to loop through all of our records! One thing to notice in the next block is the error handling. It's possible our script might encounter a malformed XML record, in which case it's not going to be able to do any of the things we just looked at. The `try` clause tries running the code below it and if it fails, it runs the `except` clause, which tells the script what to do with this failure. Here we're telling the script to write the line index number (`idx`) to a text file called 'failed.txt' so we can go back and see what failed later, if we want.

Time to iterate! This could take a few minutes.

In [217]:
log = open('failed.txt', 'wb')

for idx, line in enumerate(f):
    try:
        mods = modsqual.Mods(line)
        #Get uuid identifier
        uuids = mods.identifier.match(attr=['@type="uuid"'])
        uuid = uuids[0]
        #Get collection name. Collection is the most deeply nested relatedItem, so get all descendents and take the last in the list.
        collections = mods.relatedItem.match(xpath='.//m:relatedItem/m:titleInfo/m:title/text()')
        if collections > 0:
            coll_name = collections[-1]
        else:
            coll_name = 'Null'
        #Test true/false (1/0) if title exists. Use the exists() function to calculate score
        title = exists(mods.titleInfo)
        typeOfResource = inList(mods.typeOfResource, resource_types)
        genre = exists(mods.genre)
        date = xpathexists(mods.originInfo.match(xpath='./m:originInfo/m:dateCreated|./m:originInfo/m:dateIssued|./m:originInfo/m:copyrightDate'))
        identifier = xpathexists(mods.identifier.match(xpath='./m:identifier[@type="local_bnumber" or @type="local_mss" or @type="local_tms"]'))
        location = xpathexists(mods.location.match(xpath='./m:location/m:physicalLocation[@type="division"]'))
        #Get curatorial division name. First check if division location is present, then take the first one listed. If none, then Null.
        if (location == 1):
            division = mods.location.match(xpath='./m:location/m:physicalLocation[@type="division"]')[0]
        else:
            division = 'Null'
        scores = [title, typeOfResource, genre, date, identifier, location]
        total = sum(scores)
        row = {'uuid':uuid, 'division':division, 'collection':coll_name, 'title':title, 'typeOfResource':typeOfResource, 'genre':genre,
               'date':date, 'identifier':identifier, 'location':location, 'total':total}
        writer.writerow(row)
    except:
        log.write(str(idx))
        log.write('\n')

Finally, let's close our new files:

In [220]:
log.close()
z.close()

You should now have a csv file with ~9000 rows of baseline scores in your folder! You can follow Sara's [tutorial](https://github.com/saverkamp/measure-metadata-workshop/tree/master/pandas) to learn how to analyze this data with [pandas](http://pandas.pydata.org/)