This notebook allows the user to select XML collections and zip them up to send to a service that runs a transform on them and returns a simple CSV made up of six data points. The data included is the Collection name, Dialect name, Record name, Concept name, Content, Xpath location, and the Dialect Definition for the concept. 

The notebook utilizes Bash and Python with the default packages contained in the Mac build of Anaconda with Python 3.6. 

Saxon, Java, and XSLT form the evaluation web service on a NCEAS virtual machine. 

This CSV contains a row for each concept that is found, so some elements may fulfill multiple concepts. A good example of this are the concepts Keyword and Place Keyword. Every Place Keyword is also a Keyword, so the row would repeat with a different Concept name. It also contains a row for each undefined node that contains text, marking these rows with an Unknown in the Concept column. 

This data can be used in a variety of analyses including RAD and QuickE as well as Concept Verticals. It can also be used to teach the system dialect definitions for concepts that are currently unknown by exposing all of the content at undefined nodes. 

## First we need to call all of the libraries we need to perform in our metadata wrangle

In [60]:
import pandas as pd
import os
from os import walk
import shutil
from ipywidgets import *
import ipywidgets as widgets
import requests
import csv
import io

### Now let's select some metadata. 

If you have prepared metadata\* on your computer that you want to add, it is possible to upload into the repository locally using the [Add Metadata](00AddMetadata.ipynb) Notebook before completing the following cells in this notebook. Otherwise, follow along and use some of the sample metadata the following steps will help you to select.

\* Prepared metadata contains a root element that has a standardized namespace and namespace prefix. Many dialects such as ISO and DIF are consistently written this way, but some dialects such as CSDGM are often written by organizations as only well-formed XML.

Create a list of subdirectories in the collection directory of MILE2 to select metadata for evaluation

In [61]:
Organizations = []
for (dirpath, dirnames, filenames) in walk('../collection/'):
    Organizations.extend(dirnames)
    break  

Create a function to select the organization the metadata comes from

In [62]:
def OrganizationChoices(organization):
    global OrganizationChoice
    global Organization
    Organization=organization
    print("Organization of the collection is", Organization)


Create a dropdown using the Organizations list and the organization selector function. This sets the Organization variable.

In [63]:
interactive(OrganizationChoices, organization=Organizations)

Create a list of collections in the organization directory selected in the dropdown above

In [64]:
Collections = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization)):
    Collections.extend(dirnames)
    break 
Collections

['GES_DISC', 'GHRC', 'LARC', 'NSIDC', 'PODAAC']

Create a function to select the collection the metadata comes from

In [65]:
def CollectionChoices(collection):
    global CollectionChoice
    global Collection
    Collection=collection

Create a dropdown using the Collections list and the organization selector function. This sets the Collection variable.

In [66]:
interactive(CollectionChoices, collection=Collections)

Many organizations support multiple metadata dialects, and share their collections in more than one dialect. This list is created the same way the others are. It adds the different dialects the collection is shared in to a list.

In [67]:
Dialects = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization,Collection)):
    Dialects.extend(dirnames)
    break 
dialectList=Dialects


Create a function to select the dialect you want to send to the evaluator service.

In [68]:
def dialectChoice(dialect):
    global Dialect
    Dialect=dialect
    print("Dialect of the collection is", Dialect)


Create a dropdown using the Dialects list and the dialect selector function. This sets the Dialect variable.

In [69]:
interactive(dialectChoice,dialect=dialectList)

change to the zip directory 

In [70]:
cd ../zip

/Users/scgordon/MILE2/zip


Combine the Organization, Collection, and Dialect variables with the string 'xml' as a relative path and save the string to a variable

In [71]:
MetadataDestination=os.path.join(Organization,Collection,Dialect,'xml')
MetadataDestination

'NASA/PODAAC/ISO/xml'

Use the path to create a directory structure in the zip directory

In [72]:
os.makedirs(MetadataDestination, exist_ok=True)

Create a path to the metadata you selected earlier and save the string to a variable, 'MetadataLocation'.

In [73]:
MetadataLocation=os.path.join('../collection/',Organization,Collection,Dialect,'xml')

MetadataLocation

'../collection/NASA/PODAAC/ISO/xml'

Copy the metadata to the new directory structure.

In [74]:
src_files = os.listdir(MetadataLocation)
for file_name in src_files:
    full_file_name = os.path.join(MetadataLocation, file_name)
    if (os.path.isfile(full_file_name)):
        shutil.copy(full_file_name, MetadataDestination)

Make a zip file to upload to the evaluator service

In [75]:
shutil.make_archive('../upload/metadata', 'zip', os.getcwd())

'/Users/scgordon/MILE2/upload/metadata.zip'

In [76]:
%cd ../upload

/Users/scgordon/MILE2/upload


Send metadata to the Evaluator. Get the responses with csv encoding. This step can take up to a minute and doesn't track progress, but a dataframe or an error message will be returned.

In [77]:
url = 'http://metadig.nceas.ucsb.edu/metadata/evaluator'
files = {'zipxml': open('metadata.zip', 'rb')}
r = requests.post(url, files=files, headers={"Accept-Encoding": "gzip"})
r.raise_for_status()
EvaluatedMetadataDF = pd.read_csv(io.StringIO(r.text), quotechar='"')
EvaluatedMetadataDF

Unnamed: 0,Concept,Content,Record,XPath
0,Abstract,The IPRC/SOEST Aquarius OI-SSS v4 product is a...,C1242104955.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
1,Address,"NASA Global Change Master Directory, Goddard S...",C1242104955.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
2,Address,Unknown,C1242104955.xml,/gmi:MI_Metadata/gmd:distributionInfo/gmd:MD_D...
3,Bounding Box,-180.0 180.0 -90 90,C1242104955.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
4,Cited Resource Identifier,AQUARIUS_L4_OISSS_IPRC_7DAY_V4,C1242104955.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
5,Cited Resource Title,AQUARIUS_L4_OISSS_IPRC_7DAY_V4 > IPRC/SOEST Aq...,C1242104955.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
6,Cited Resource Title,NASA/GCMD Science Keywords,C1242104955.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
7,Cited Resource Title,NASA/Global Change Master Directory (GCMD) Loc...,C1242104955.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
8,Cited Resource Title,NASA/Global Change Master Directory (GCMD) Dat...,C1242104955.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
9,Cited Resource Title,NASA/GCMD Platform Keywords,C1242104955.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...


Save the dataframe as a csv for further analysis. Copy the csv to a directory, named for the organization that had the metadata in it's holdings. Give it a filename matching the the metadata collection and dialect.

Clear up temporary files and directories, switch to the data directory

In [78]:
Filedirectory=os.path.join('../data/',Organization)
Filename='/'+Collection+'_'+Dialect+'_Evaluated.csv'
FilePath=Filedirectory+Filename
FilePath
EvaluatedMetadataDF.to_csv(FilePath, mode = 'w', index=False)
%cd ../
shutil.rmtree('upload')
%cd zip
shutil.rmtree(Organization)
%cd ../data

/Users/scgordon/MILE2
/Users/scgordon/MILE2/zip
/Users/scgordon/MILE2/data


Now that we have our metadata raw data prepared and stored, we can prepare the collection's data for recommendation analytics and cross collection analytics.

Create a table with each record as a row of concept occurance counts. Each concept that occurs in the collection is a row.

In [92]:
FiledirectoryRAD=os.path.join('../data/',Organization)
FilenameRAD='/'+Collection+'_'+Dialect+'_RAD.csv'
FilePathRAD=FiledirectoryRAD+FilenameRAD
group_name = EvaluatedMetadataDF.groupby(['Record', 'Concept'], as_index=False)
occuranceMatrix=group_name.size().unstack().reset_index()
occuranceMatrix=occuranceMatrix.fillna(0)
pd.options.display.float_format = '{:,.0f}'.format
occuranceMatrix.to_csv(FilePathRAD, mode = 'w', index=False)

Concept,Record,Abstract,Address,Bounding Box,Browse Description,Browse File Name,Browse Format,Browse URL,Cited Resource Identifier,Cited Resource Title,...,Standard Name Vocabulary,Start Time,Supplemental Information,Temporal Extent,Theme Keyword,URL,Unknown,VariableType,Web Page,Westernmost Longitude
0,C1242104955.xml,1,2,1,0,0,0,0,1,10,...,5,1,1,1,1,2,183,1,4,1
1,C1242476856.xml,1,2,1,0,0,0,0,1,8,...,5,1,1,1,1,2,174,1,4,1
2,C1242560486.xml,1,2,1,0,0,0,0,1,8,...,5,1,1,1,1,2,177,1,4,1
3,C1251066968.xml,1,2,1,0,0,0,0,1,8,...,5,1,1,1,1,5,183,1,4,1
4,C1251117718.xml,1,2,1,0,0,0,0,1,8,...,5,1,1,1,1,5,182,1,4,1
5,C1254640879.xml,1,2,1,0,0,0,0,1,12,...,5,1,1,1,1,4,267,1,4,1
6,C1257704009.xml,1,2,1,0,0,0,0,1,8,...,5,1,1,1,1,4,160,1,4,1
7,C1257710580.xml,1,2,1,0,0,0,0,1,8,...,5,1,1,1,1,4,163,1,4,1
8,C1257843632.xml,1,2,1,0,0,0,0,1,8,...,5,1,1,1,1,4,160,1,4,1
9,C1268959235.xml,1,2,1,0,0,0,0,1,12,...,5,1,1,1,1,4,212,1,4,1


In [101]:
FiledirectoryQuickE=os.path.join('../data/',Organization)
FilenameQuickE='/'+Collection+'_'+Dialect+'_QuickE.csv'
FilePathQuickE=FiledirectoryQuickE+FilenameQuickE
group_name = EvaluatedMetadataDF.groupby(['XPath', 'Record'], as_index=False)
QuickEdf=group_name.size().unstack().reset_index()
QuickEdf=QuickEdf.fillna(0)
QuickEdf['XPath']=QuickEdf['XPath'].str.replace('/[a-z]*:[A-Z]*_[A-Za-z]*/', '/')

pd.options.display.float_format = '{:,.0f}'.format
QuickEdf.to_csv(FilePathQuickE, mode = 'w', index=False)
QuickEdf

Record,XPath,C1242104955.xml,C1242476856.xml,C1242560486.xml,C1251066968.xml,C1251117718.xml,C1254640879.xml,C1257704009.xml,C1257710580.xml,C1257843632.xml,...,C1285672056.xml,C1287440846.xml,C1287673549.xml,C1289641362.xml,C1289642534.xml,C1289643454.xml,C1289648148.xml,C1289648318.xml,C1289653609.xml,C1289656046.xml
0,/gmd:characterSet/gmd:MD_CharacterSetCode,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,/gmd:characterSet/@codeList,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,/gmd:characterSet/@codeListValue,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3,/gmd:contact,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,/gmd:contact/gmd:CI_ResponsibleParty,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
5,/gmd:contact/gmd:organisationName,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
6,/gmd:contact/gmd:organisationName/gco:Characte...,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
7,/gmd:contact/gmd:role/gmd:CI_RoleCode,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
8,/gmd:contact/gmd:role/@codeList,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
9,/gmd:contact/gmd:role/@codeListValue,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [96]:
FiledirectoryOccurance=os.path.join('../data/',Organization)
FilenameOccurance='/'+Collection+'_'+Dialect+'_Occurance.csv'
FilePathOccurance=FiledirectoryOccurance+FilenameOccurance
occuranceSum=occuranceMatrix.sum()
occuranceCount=occuranceMatrix[occuranceMatrix!=0].count()
CollectionName=FilenameOccurance.partition("/")[2].partition("_Occurance.csv")[0]
result = pd.concat([occuranceSum, occuranceCount], axis=1).reset_index()
result.insert(1, 'Collection', CollectionName)
result.insert(4, 'CollectionOccurance%', CollectionName)
result.insert(4, 'AverageOccurancePerRecord', CollectionName)
result.columns = ['Concept', 'Collection', 'ConceptCount', 'RecordCount', 'AverageOccurancePerRecord', 'CollectionOccurance%' ]
NumberOfRecords = result.at[0, 'ConceptCount'].count('.xml')
result['CollectionOccurance%'] = result['RecordCount']/NumberOfRecords
result['CollectionOccurance%'] = pd.Series(["{0:.2f}%".format(val * 100) for val in result['CollectionOccurance%']], index = result.index)
result.at[0, 'ConceptCount'] = NumberOfRecords
result.at[0, 'Concept'] = 'Number of Records'
result['AverageOccurancePerRecord'] = result['ConceptCount']/NumberOfRecords
result[["ConceptCount","RecordCount"]] = result[["ConceptCount","RecordCount"]].astype(int)
result['CollectionOccurance%'] = pd.Series(["{0:.2f}%".format(val * 100) for val in result['CollectionOccurance%']], index = result.index)
result.to_csv(FilePathOccurance, mode = 'w', index=False)
result

Unnamed: 0,Concept,Collection,ConceptCount,RecordCount,AverageOccurancePerRecord,CollectionOccurance%
0,Number of Records,PODAAC_ISO,101,101,1,100.00%
1,Abstract,PODAAC_ISO,101,101,1,100.00%
2,Address,PODAAC_ISO,201,101,2,100.00%
3,Bounding Box,PODAAC_ISO,103,101,1,100.00%
4,Browse Description,PODAAC_ISO,15,15,0,14.85%
5,Browse File Name,PODAAC_ISO,15,15,0,14.85%
6,Browse Format,PODAAC_ISO,15,15,0,14.85%
7,Browse URL,PODAAC_ISO,15,15,0,14.85%
8,Cited Resource Identifier,PODAAC_ISO,101,101,1,100.00%
9,Cited Resource Title,PODAAC_ISO,1054,101,10,100.00%


### Select the notebook that prepares the data for different types of analysis

* [Create RAD Data](02RADdf.ipynb)
* [Cross Collection Comparisons](03CrossCollectionComparisons.ipynb)
* [Concept Content Consistency](04ConceptVerticals.ipynb)
* [Exploring Unknown Concepts](05ExploringUnknownConcepts.ipynb)