# Resolve Metadata to HTRC IDs

This notebook is a part of work being done for [The Trace of Theory project](https://www.hathitrust.org/htrc_acs_awards_spring2015), a collaboration between researchers of [NovelTM](http://novel-tm.ca/) and the HathiTrust Research Center ([HTRC](https://www.hathitrust.org/)).

In particular, we are wanting to use both [supervised](https://en.wikipedia.org/wiki/Supervised_learning) and [unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning) machine learning techniques on HTRC texts to gain a better understanding of the extent and nature of theory in various genres.

A step along the way has been to develop a list of theoretical texts in both philosophical and literary genres (about 100 texts in each list). We've created lists of authors and titles, but we now need a way to reference this against specific items in the HTRC corpus. To accomplish this, we'll create a very simple mechanism that produces a query from the author and title fields that can be used with HTRC API (see [HTRC Technical Documents]( https://wiki.htrc.illinois.edu/display/COM/HTRC+Technical+Documents) for more info).

This is a summary of what we'll do:

* define a function to produce a Solr query from metadata
* define a function to retrieve candidate IDs from the HTRC API using our query
* define a function to add a column to a table that already defines author and title values
* use these functions to resolve HTRC IDs in our list of philosophical and literary texts

## Produce Solr-compatible Query from Metadata

In order to prepare a query for the HTRC API we need to look at the existing metadata and treat different fields differently. For instance, authors can come in various forms (such as "Atwood, Margeret", "Margeret Atwood", "M. Atwood"), so we'll create query component that uses the full names (not abbreviations) that are present, in any order. Word order does matter for titles, so we'll try to make a phrase from the existing title string (we'll strip out non-alphanumeric and additional whitespace characters since the HTRC query engine will likely do that anyway).

In [120]:
import re

# This takes metadata passed as arguments and produces a Solr-compatible query parameter.
# Though we're only using author and title at the moment, we're defining this flexibly to
# allow for further functionality.
def get_solr_query(**kwargs):
    
    atoms = [] # keep track of the parts of our query
    
    if 'author' in kwargs: # look for author metadata
        atoms += ["author:"+word for word in re.split('\W+', kwargs['author']) if len(word) > 1]
    
    if 'title' in kwargs: # look for title metadata
        atoms.append('title:"'+re.sub('\W+', ' ', kwargs['title'])+'"')

    return ' AND '.join(atoms)

## Fetch HTRC IDs from a Query

Now that we can produce a Solr-compatble query, let's define a function that calls the HTRC API with the query and extracts from the response the desired IDs. It's worth noting that a query might result in zero, one or many results (up to 10 based on the default limit).

In [121]:
base_url = 'http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?' # sandbox URL

import urllib.request
import xml.etree.ElementTree as ET

# This takes metadata passed as arguments and tries to resolve it to a list of IDs.
def get_ids(**kwargs):
    
    # fetch contents (no real error checking at the moment)
    url = base_url + urllib.parse.urlencode({'q': get_solr_query(**kwargs)})
    response = urllib.request.urlopen(url)
    xml = response.read()
    
    # parse XML contents
    root = ET.fromstring(xml)
    return [element.text for element in root.findall(".//str[@name='id']")]

## Append ID Column to a Metadata Table

Now that we have function to retrieve HTRC IDs from metadata all that's left to do is to create a function that enhances an existing metadata table by adding a column for the IDs. For the sake of convenience we'll rely on [Pandas](http://pandas.pydata.org/) to manage our data tables.

In [122]:
import pandas as pd

# Take an existing Pandas DataFrame with metadata and add a column with IDs
def append_ids(dataFrame):
    dataFrame['ids'] = pd.Series(dataFrame.count(), index=dataFrame.index)
    for index, row in dataFrame.iterrows():
        ids = get_ids(**row.to_dict())
        dataFrame.ix[index,'ids']=';'.join(ids)

## Enhance Lists of Philosophy and Literary Texts

Everything is ready for us to use, we'll read in each of the existing metadata files and produce output with the ids. The source datafiles were exported as comma-separated values from Excel (files provided by Geoffrey and Laura with slight edits to the header row).

In [123]:
import pandas as pd
from IPython.core.display import HTML, display

# read in source metadata file and output enhanced metadata file
def enhance_list(file_base, extension="tsv"):
    dataFrame = pd.read_table(file_base+"."+extension)
    append_ids(dataFrame)
    dataFrame.to_csv(file_base+"-with-ids."+extension, index=False, sep="\t")
    display(HTML(dataFrame.to_html()))

### Philosophy List

In [124]:
enhance_list("philosophy-list")

Unnamed: 0,author,title,ids
0,Plato,The Republic,hvd.32044077696110;nyp.33433081632998;mdp.3901...
1,Plato,Theaetetus,hvd.hn3kts;hvd.32044010441350;uc2.ark:/13960/t...
2,Plato,Apology,uc1.31822009493982;mdp.39015001808750;loc.ark:...
3,Plato,Phaedrus,hvd.hxjfcr;mdp.39015004072206;uc1.c072610589;m...
4,Plato,Euthyphro,uc1.b3922458;uc2.ark:/13960/t1sf2x49r;uc1.$b68...
5,Aristotle,Politics,uiug.30112055523101;mdp.39015066423750;mdp.390...
6,Aristotle,The Categories,
7,Aristotle,Poetics,hvd.32044010427722;hvd.32044012915443;uc2.ark:...
8,Aristotle,Ethics,uc2.ark:/13960/t9q23t861;mdp.39015065617824;md...
9,Aristotle,The Nicomachean Ethics,uc1.32106000024569;nnc1.50207993;uc1.a00036588...


In [125]:
## Literary List

In [126]:
enhance_list("literary-list")

Unnamed: 0,author,title,ids
0,Aristotle,Poetics,hvd.32044010427722;hvd.32044012915443;uc2.ark:...
1,Aristotle,Rhetoric,mdp.39015005479871;mdp.39015051095548;njp.3210...
2,Horace,Ars Poetica / Art of Poetry,
3,Longinus,On the Sublime,umn.31951001668292k;mdp.39015004928894;hvd.320...
4,Philip Sidney,An Apology for Poetry,mdp.39015011889204;gri.ark:/13960/t87h3t70j;hv...
5,Pierre Corneille,Of the Three Unities,
6,John Dryden,An Essay of Dramatic Poesy,uc1.$b272452;mdp.39015030940434;mdp.3901503123...
7,Nicolas Boileau-Despreaux,The Art of Poetry,mdp.39015059894330;mdp.39015008498340;njp.3210...
8,John Dennis,The Advancement and Reformation of Modern Poetry,
9,Alexander Pope,An Essay on Criticism,pst.000005296095;mdp.39015000549983;nyp.334330...


--_
Stéfan Sinclair, Creative Commons (CC-BY)