This notebook allows the user to select XML collections and zip them up to send to a service that runs a transform on them and returns a simple CSV made up of six data points. The data included is the Collection name, Dialect name, Record name, Concept name, Content, Xpath location, and the Dialect Definition for the concept. 

The notebook utilizes Bash and Python with the default packages contained in the Mac build of Anaconda with Python 3.6. Saxon, Java, and XSLT form the evaluation service in a virtual machine on an NCEAS server. 

This CSV contains a row for each concept that is found, so some locations may fulfill multiple concepts. A good example of this are the concepts Keyword and Place Keyword. Every Place Keyword is also a Keyword, so the row would repeat with a different Concept name. It also contains a row for each undefined node that contains text, marking these rows with an Unknown in the Concept column. 

This data can be used in a variety of analyses including RAD and QuickE as well as Concept Verticals. It can also be used to teach the system dialect definitions for concepts that are currently unknown by exposing all of the content at undefined nodes. 

## First we need to call all of the libraries we need to perform in our metadata wrangle

In [2]:
import pandas as pd
import os
from os import walk
import shutil
from ipywidgets import *
import ipywidgets as widgets
import requests
from contextlib import closing
import csv
import io

### Now let's select some metadata. 

If you have clean metadata on your computer that you want to add, it is possible to upload into the repository locally using the [Add Metadata](AddMetadata.ipynb) Notebook before completing the following cells in this notebook. Otherwise, follow along and use some of the sample metadata the following steps will help you to select.

Create a list of subdirectories in the collection directory of MILE2 to select metadata for evaluation

In [3]:
Organizations = []
for (dirpath, dirnames, filenames) in walk('../collection/'):
    Organizations.extend(dirnames)
    break  

Create a function to select the organization the metadata comes from

In [4]:
def OrganizationChoices(organization):
    global OrganizationChoice
    global Organization
    Organization=organization
    print("Organization of the collection is", Organization)


Create a dropdown using the Organizations list and the organization selector function. This sets the Organization variable.

In [5]:
interactive(OrganizationChoices, organization=Organizations)

Create a list of collections in the organization directory selected in the dropdown above

In [6]:
Collections = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization)):
    Collections.extend(dirnames)
    break 
Collections

['LTER_2005',
 'LTER_2006',
 'LTER_2007',
 'LTER_2008',
 'LTER_2009',
 'LTER_2010',
 'LTER_2011',
 'LTER_2012',
 'LTER_2013',
 'LTER_2014',
 'LTER_2015',
 'LTER_2016']

Create a function to select the collection the metadata comes from

In [7]:
def CollectionChoices(collection):
    global CollectionChoice
    global Collection
    Collection=collection

Create a dropdown using the Collections list and the organization selector function. This sets the Collection variable.

In [8]:
interactive(CollectionChoices, collection=Collections)

Many organizations support multiple metadata dialects, and share their collections in more than one dialect. This list is created the same way the others are. It adds the different dialects the collection is shared in to a list.

In [9]:
Dialects = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization,Collection)):
    Dialects.extend(dirnames)
    break 
dialectList=Dialects


Create a function to select the dialect you want to send to the evaluator service.

In [10]:
def dialectChoice(dialect):
    global Dialect
    Dialect=dialect
    print("Dialect of the collection is", Dialect)


Create a dropdown using the Dialects list and the dialect selector function. This sets the Dialect variable.

In [11]:
interactive(dialectChoice,dialect=dialectList)

change to the zip directory 

In [12]:
%cd ../zip

/Users/scgordon/MILE2/zip


Combine the Organization, Collection, and Dialect variables with the string 'xml' as a relative path and save the string to a variable

In [13]:
MetadataDestination=os.path.join(Organization,Collection,Dialect,'xml')
MetadataDestination

'LTERthroughTime/LTER_2008/EML/xml'

Use the path to create a directory structure in the zip directory

In [14]:
os.makedirs(MetadataDestination, exist_ok=True)

Create a path to the metadata you selected earlier and save the string to a variable, 'MetadataLocation'.

In [15]:
MetadataLocation=os.path.join('../collection/',Organization,Collection,Dialect,'xml')

MetadataLocation

'../collection/LTERthroughTime/LTER_2008/EML/xml'

Copy the metadata to the new directory structure.

In [16]:
src_files = os.listdir(MetadataLocation)
for file_name in src_files:
    full_file_name = os.path.join(MetadataLocation, file_name)
    if (os.path.isfile(full_file_name)):
        shutil.copy(full_file_name, MetadataDestination)

Make a zip file to upload to the evaluator service

In [17]:
shutil.make_archive('../upload/metadata', 'zip', os.getcwd())

'/Users/scgordon/MILE2/upload/metadata.zip'

In [18]:
cd ../upload

/Users/scgordon/MILE2/upload


Send metadata to the Evaluator. Get the responses with csv encoding. This step can take up to a minute and doesn't track progress, but a dataframe or an error message will be returned.

In [19]:
url = 'http://metadig.nceas.ucsb.edu/metadata/evaluator'
files = {'zipxml': open('metadata.zip', 'rb')}
r = requests.post(url, files=files)
r.raise_for_status()
CollectionConceptsDF = pd.read_csv(io.StringIO(r.text))
CollectionConceptsDF

ConnectionError: HTTPConnectionPool(host='metadig.nceas.ucsb.edu', port=80): Max retries exceeded with url: /metadata/evaluator (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x111c83e48>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

Save the dataframe as a csv for further analysis

In [23]:
CollectionConceptsDF.to_csv('../data/data.csv', mode = 'w', index=False)

Clear up temporary files and directories, switch to the data directory

In [24]:
%cd ../
shutil.rmtree('upload')
%cd zip
shutil.rmtree(Organization)
%cd ../data

/Users/scgordon/MILE2
/Users/scgordon/MILE2/zip
/Users/scgordon/MILE2/data


Copy the csv to a directory, named for the organization that had the metadata in it's holdings. Give it a filename matching the the metadata collection and dialect

In [25]:
shutil.copy("data.csv", os.path.join(Organization,Collection+'_'+Dialect+'_'+'data.csv'))

'NASA/GHRC_CSDGM_data.csv'

Now that we have our metadata data prepared and stored, we can look at collection analytics, cross collection analytics, concept verticals, and help define unknown concepts.

### Select the notebook that prepares the data for different types of analysis

* [Concept Verticals](ConceptVerticals.ipynb)
* [Quick Evaluation Cross Collection Comparisons](QuickEvaluation-CrossCollectionComparisons.ipynb)
* [Create RAD Data](CreateRADdata.ipynb)
* [Exploring Unknown Document Locations](ExploringUnknownDocumentLocations.ipynb)