This notebook allows the user to select XML collections and zip them up to send to a service that runs a transform on them and returns a simple CSV made up of six data points. The data included is the Collection name, Dialect name, Record name, Concept name, Content, Xpath location, and the Dialect Definition for the concept. 

The notebook utilizes Bash and Python with the default packages contained in the Mac build of Anaconda with Python 3.6. 

Saxon, Java, and XSLT form the evaluation web service on a NCEAS virtual machine. 

This CSV contains a row for each concept that is found, so some elements may fulfill multiple concepts. A good example of this are the concepts Keyword and Place Keyword. Every Place Keyword is also a Keyword, so the row would repeat with a different Concept name. It also contains a row for each undefined node that contains text, marking these rows with an Unknown in the Concept column. 

This data can be used in a variety of analyses including RAD and QuickE as well as Concept Verticals. It can also be used to teach the system dialect definitions for concepts that are currently unknown by exposing all of the content at undefined nodes. 

## First we need to call all of the libraries we need to perform in our metadata wrangle

In [51]:
import pandas as pd
pd.options.display.width = 180
#import os
from os import walk
import shutil
import ipywidgets as widgets
from ipywidgets import *
import requests
import csv
import io

### Now let's select some metadata. 

If you have prepared metadata\* on your computer that you want to add, it is possible to upload into the repository locally using the [Add Metadata](00AddMetadata.ipynb) Notebook before completing the following cells in this notebook. Otherwise, follow along and use some of the sample metadata the following steps will help you to select.

\* Prepared metadata contains a root element that has a standardized namespace and namespace prefix. Many dialects such as ISO and DIF are consistently written this way, but some dialects such as CSDGM are often written by organizations as only well-formed XML.

Create a list of subdirectories in the collection directory of MILE2 to select metadata for evaluation

In [52]:
Organizations = []
for (dirpath, dirnames, filenames) in walk('../collection/'):
    Organizations.extend(dirnames)
    break  

Create a function to select the organization the metadata comes from

In [53]:
def OrganizationChoices(organization):
    global OrganizationChoice
    global Organization
    Organization=organization
    print("Organization of the collection is", Organization)


Create a dropdown using the Organizations list and the organization selector function. This sets the Organization variable.

In [54]:
interactive(OrganizationChoices, organization=Organizations)

Create a list of collections in the organization directory selected in the dropdown above

In [55]:
Collections = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization)):
    Collections.extend(dirnames)
    break 
Collections

['LTER',
 'CDL',
 'US_MPC',
 'LTER_EUROPE',
 'USANPN',
 'PISCO',
 'KUBI',
 'IARC',
 'RGD',
 'SEAD',
 'ONEShare',
 'GOA',
 'EDACGSTORE',
 'KNB',
 'TFRI',
 'ESA',
 'USGSCSAS',
 'GLEON',
 'DRYAD',
 'SANPARKS',
 'TERN',
 'EDORA',
 'NMEPSCOR',
 'ORNLDAAC',
 'IOE']

Create a function to select the collection the metadata comes from

In [56]:
def CollectionChoices(collection):
    global CollectionChoice
    global Collection
    Collection=collection

Create a dropdown using the Collections list and the organization selector function. This sets the Collection variable.

In [57]:
interactive(CollectionChoices, collection=Collections)

Many organizations support multiple metadata dialects, and share their collections in more than one dialect. This list is created the same way the others are. It adds the different dialects the collection is shared in to a list.

In [64]:
Dialects = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization,Collection)):
    Dialects.extend(dirnames)
    break 
dialectList=Dialects


Create a function to select the dialect you want to send to the evaluator service.

In [65]:
def dialectChoice(dialect):
    global Dialect
    Dialect=dialect
    print("Dialect of the collection is", Dialect)


Create a dropdown using the Dialects list and the dialect selector function. This sets the Dialect variable.

In [66]:
interactive(dialectChoice,dialect=dialectList)

change to the zip directory 

In [67]:
cd ../zip

/Users/scgordon/MILE2/zip


Combine the Organization, Collection, and Dialect variables with the string 'xml' as a relative path and save the string to a variable

In [68]:
MetadataDestination=os.path.join(Organization,Collection,Dialect,'xml')
MetadataDestination

'DataONE/US_MPC/Onedcx/xml'

Use the path to create a directory structure in the zip directory

In [69]:
os.makedirs(MetadataDestination, exist_ok=True)

Create a path to the metadata you selected earlier and save the string to a variable, 'MetadataLocation'.

In [70]:
MetadataLocation=os.path.join('../collection/',Organization,Collection,Dialect,'xml')

MetadataLocation

'../collection/DataONE/US_MPC/Onedcx/xml'

Copy the metadata to the new directory structure.

In [71]:
src_files = os.listdir(MetadataLocation)
for file_name in src_files:
    full_file_name = os.path.join(MetadataLocation, file_name)
    if (os.path.isfile(full_file_name)):
        shutil.copy(full_file_name, MetadataDestination)

Make a zip file to upload to the evaluator service

In [72]:
shutil.make_archive('../upload/metadata', 'zip', os.getcwd())

'/Users/scgordon/MILE2/upload/metadata.zip'

Send metadata to the Evaluator. Get the responses with csv encoding. This step can take up to a minute and doesn't track progress, but a dataframe or an error message will be returned.

Save the dataframe as a csv for further analysis. Copy the csv to a directory, named for the organization that had the metadata in it's holdings. Give it a filename matching the the metadata collection and dialect.

Clear up temporary files and directories, switch to the data directory

In [73]:
%cd ../upload 
# Send metadata package, read the response into a dataframe
url = 'http://metadig.nceas.ucsb.edu/metadata/evaluator'
files = {'zipxml': open('metadata.zip', 'rb')}
r = requests.post(url, files=files, headers={"Accept-Encoding": "gzip"})
r.raise_for_status()
EvaluatedMetadataDF = pd.read_csv(io.StringIO(r.text), quotechar='"')

#build filepaths and file names
Filedirectory=os.path.join('../data/',Organization)
Filename='/'+Collection+'_'+Dialect+'_Evaluated.csv.gz'
SimplfiedFilename='/'+Collection+'_'+Dialect+'_EvaluatedSimplified.csv.gz'
FilePath=Filedirectory+Filename
SimplifiedFilePath=Filedirectory+SimplfiedFilename
EvaluatedMetadataDF.insert(3, 'Collection', Organization+'_'+Collection+'_'+Dialect)

EvaluatedMetadataDF.to_csv(FilePath, mode = 'w', compression='gzip', index=False)

#Change directories, delete upload directory and zip. Delete copied metadata.
%cd ../
shutil.rmtree('upload')
%cd zip
shutil.rmtree(Organization)
%cd ../data

#Create a simplified XPath output
EvaluatedSimplifiedMetadataDF = EvaluatedMetadataDF.copy()
EvaluatedSimplifiedMetadataDF['XPath']=EvaluatedSimplifiedMetadataDF['XPath'].str.replace('/gco:CharacterString', '')
EvaluatedSimplifiedMetadataDF['XPath']=EvaluatedSimplifiedMetadataDF['XPath'].str.replace('/[a-z]+:+?', '/')
EvaluatedSimplifiedMetadataDF['XPath']=EvaluatedSimplifiedMetadataDF['XPath'].str.replace('/[A-Z]+_[A-Za-z]+/?', '/')
EvaluatedSimplifiedMetadataDF['XPath']=EvaluatedSimplifiedMetadataDF['XPath'].str.replace('//', '/')
EvaluatedSimplifiedMetadataDF['XPath']=EvaluatedSimplifiedMetadataDF['XPath'].str.rstrip('//')
EvaluatedSimplifiedMetadataDF.to_csv(SimplifiedFilePath, mode = 'w', compression='gzip', index=False)

/Users/scgordon/MILE2/upload
/Users/scgordon/MILE2
/Users/scgordon/MILE2/zip
/Users/scgordon/MILE2/data


Now that we have our metadata raw data prepared and stored, we can prepare the collection's data for recommendation analytics and cross collection analytics.

Create a table with each record as a row of concept occurance counts. Each concept that occurs in the collection is a row.

In [85]:
FiledirectoryRAD=os.path.join('../data/',Organization)
FilenameRAD='/'+Collection+'_'+Dialect+'_RAD.csv'
FilePathRAD=FiledirectoryRAD+FilenameRAD
group_name = EvaluatedSimplifiedMetadataDF.groupby(['Collection','Record', 'Concept'], as_index=False)
occuranceMatrix=group_name.size().unstack().reset_index()
occuranceMatrix=occuranceMatrix.fillna(0)
occuranceMatrix.columns.names = ['']
pd.options.display.float_format = '{:,.0f}'.format
occuranceMatrix.to_csv(FilePathRAD, mode = 'w', index=False)
occuranceMatrix

Unnamed: 0,Collection,Record,Abstract,Author,Author / Originator,Bounding Box,Embargo Date,Keyword,Metadata Access Constraints,Place Keyword,...,Related Resource Identifier,Resource Access Constraints,Resource Creation/Revision Date,Resource Identifier,Resource Title,Resource Type,Spatial Extent,Table of Contents,Temporal Extent,Unknown
0,DataONE_US_MPC_Onedcx,03382-metadata.xml,1,1,1,1,1,26,1,1,...,3,1,1,1,1,1,2,1,1,2
1,DataONE_US_MPC_Onedcx,03383-metadata.xml,1,1,1,1,1,21,1,1,...,3,1,1,1,1,1,2,1,1,2
2,DataONE_US_MPC_Onedcx,03384-metadata.xml,1,1,1,1,1,20,1,1,...,3,1,1,1,1,1,2,1,1,2
3,DataONE_US_MPC_Onedcx,03385-metadata.xml,1,1,1,1,1,18,1,1,...,3,1,1,1,1,1,2,1,1,2
4,DataONE_US_MPC_Onedcx,03386-metadata.xml,1,1,1,1,1,23,1,1,...,3,1,1,1,1,1,2,1,1,2
5,DataONE_US_MPC_Onedcx,03387-metadata.xml,1,1,1,1,1,24,1,1,...,3,1,1,1,1,1,2,1,1,2
6,DataONE_US_MPC_Onedcx,03388-metadata.xml,1,1,1,1,1,20,1,1,...,3,1,1,1,1,1,2,1,1,2
7,DataONE_US_MPC_Onedcx,03389-metadata.xml,1,1,1,1,1,16,1,1,...,3,1,1,1,1,1,2,1,1,2
8,DataONE_US_MPC_Onedcx,03390-metadata.xml,1,1,1,1,1,18,1,1,...,3,1,1,1,1,1,2,1,1,2
9,DataONE_US_MPC_Onedcx,03391-metadata.xml,1,1,1,1,1,18,1,1,...,3,1,1,1,1,1,2,1,1,2


In [89]:
FiledirectoryQuickE=os.path.join('../data/',Organization)
FilenameQuickE='/'+Collection+'_'+Dialect+'_QuickE.csv'
FilePathQuickE=FiledirectoryQuickE+FilenameQuickE
group_name = EvaluatedSimplifiedMetadataDF.groupby(['Collection','XPath', 'Record'], as_index=False)
QuickEdf=group_name.size().unstack().reset_index()
QuickEdf=QuickEdf.fillna(0)
pd.options.display.float_format = '{:,.0f}'.format
QuickEdf.to_csv(FilePathQuickE, mode = 'w', index=False)
QuickEdf

Record,Collection,XPath,03382-metadata.xml,03383-metadata.xml,03384-metadata.xml,03385-metadata.xml,03386-metadata.xml,03387-metadata.xml,03388-metadata.xml,03389-metadata.xml,...,03622-metadata.xml,03623-metadata.xml,03624-metadata.xml,03625-metadata.xml,03626-metadata.xml,03627-metadata.xml,03628-metadata.xml,03629-metadata.xml,03630-metadata.xml,03631-metadata.xml
0,DataONE_US_MPC_Onedcx,/metadata/@xsi:schemaLocation,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,DataONE_US_MPC_Onedcx,/metadata/dcTerms/abstract,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,DataONE_US_MPC_Onedcx,/metadata/dcTerms/accessRights,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
3,DataONE_US_MPC_Onedcx,/metadata/dcTerms/available,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,DataONE_US_MPC_Onedcx,/metadata/dcTerms/dateSubmitted,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
5,DataONE_US_MPC_Onedcx,/metadata/dcTerms/hasPart,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
6,DataONE_US_MPC_Onedcx,/metadata/dcTerms/modified,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
7,DataONE_US_MPC_Onedcx,/metadata/dcTerms/references,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
8,DataONE_US_MPC_Onedcx,/metadata/dcTerms/spatial,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,4
9,DataONE_US_MPC_Onedcx,/metadata/dcTerms/spatial/@xsi:type,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [1]:
FiledirectoryOccurance=os.path.join('../data/',Organization)
FilenameOccurance='/'+Collection+'_'+Dialect+'_Occurance.csv'
FilePathOccurance=FiledirectoryOccurance+FilenameOccurance
occuranceSum=occuranceMatrix.sum()
occuranceCount=occuranceMatrix[occuranceMatrix!=0].count()
CollectionName=FilenameOccurance.partition("/")[2].partition("_Occurance.csv")[0]
result = pd.concat([occuranceSum, occuranceCount], axis=1).reset_index()
result.insert(1, 'Collection', CollectionName)
result.insert(4, 'CollectionOccurance%', CollectionName)
result.insert(4, 'AverageOccurancePerRecord', CollectionName)
result.columns = ['Concept', 'Collection', 'ConceptCount', 'RecordCount', 'AverageOccurancePerRecord', 'CollectionOccurance%' ]
NumberOfRecords = result.at[0, 'ConceptCount'].count('.xml')
result['CollectionOccurance%'] = result['RecordCount']/NumberOfRecords
result['CollectionOccurance%'] = pd.Series(["{0:.2f}%".format(val * 100) for val in result['CollectionOccurance%']], index = result.index)
result.at[0, 'ConceptCount'] = NumberOfRecords
result.at[0, 'Concept'] = 'Number of Records'
result['AverageOccurancePerRecord'] = result['ConceptCount']/NumberOfRecords
result['AverageOccurancePerRecord'] = result['AverageOccurancePerRecord'].astype(float)
result[["ConceptCount","RecordCount"]] = result[["ConceptCount","RecordCount"]].astype(int)
result['AverageOccurancePerRecord'] = pd.Series(["{0:.2f}".format(val) for val in result['AverageOccurancePerRecord']], index = result.index)
result.to_csv(FilePathOccurance, mode = 'w', index=False)
result

NameError: name 'os' is not defined

### Select the notebook that prepares the data for different types of analysis

* [Create RAD Data](02RADdf.ipynb)
* [Cross Collection Comparisons](03CrossCollectionComparisons.ipynb)
* [Concept Content Consistency](04ConceptVerticals.ipynb)
* [Exploring Unknown Concepts](05ExploringUnknownConcepts.ipynb)