##  This notebook allows the user to choose a collection of XML, run a XSL transform using Saxon and Java through BASH, import the resultant CSV and use Python to determine what concepts are present, which records are missing what content which elements are undefined and what the metadata vertical for each undefined element is. 

### The intent here is to create a python framework that allows new collections of xml (would LOVE to read in json and fits) to be added to a directory structure via direct upload or entering url(s) then analyzing the collections, allowing the workflow to be initiated with dropdowns and text boxes and returning the visualizations that characterize collection completeness.

### This allows us to use what we know to expose what we don't have defined conceptually for a dialect as well as what we know about that element's content. With these results and a room full of 'native speakers of each dialect, it should be possible to accelerate the pace to which we would be able to arrive at a point when we can translate 100% into every public earth science and library science dialect, we can provide a container for the data that should be understandable by all of the communities in play. We can even go further, creating space in the container for the notebooks, code, libraries, etc that the provenance references so that the environment the science was created in can be replicated and reproduce the results or open them to scrutiny. (okay maybe leave that last part out...)

### Perhaps the biggest advantage this approach has over a Docker instance is that it can be objectized and then parts can be pulled out as needed or performed online to provide the output a scientist needs.

### Read in a csv that the user selects by selecting the organization and collection. 

In [116]:
import pandas as pd
import os
from os import walk
import numpy as np
from ipykernel import kernelapp as app
from __future__ import print_function
from ipywidgets import *
import ipywidgets as widgets
from bokeh.charts import output_notebook, output_file, show, Bar, Scatter, Histogram, TimeSeries
from bokeh.plotting import figure
from bokeh.models import Range1d, HoverTool, ResizeTool
from bokeh.charts import defaults
from bokeh.models import Legend
defaults.width = 1200
defaults.height = 800
output_notebook()
#from ipywidgets import Button, Layout
#from glob import glob

### query the directory for subdirectory names, return them in a list.

In [118]:
Organizations = []
for (dirpath, dirnames, filenames) in walk('/Users/scgordon/ConceptMining/data/'):
    Organizations.extend(dirnames)
    break
Organizations    

['BCO-DMO',
 'DataOne',
 'IEDA',
 'LTERthroughTime',
 'NASA',
 'NCAR',
 'ORNL',
 'USGS']

### Create a function to populate a variable used to populate the list for the collection dropdown

In [119]:
def OrganizationChoices(Organization):
    global OrganizationChoice
    global Collections
    OrganizationChoice=os.path.join('/Users/scgordon/ConceptMining/data',Organization)
    Collections=os.listdir(OrganizationChoice)


### Choose the organization you want to look at by creating a dropdown with the function that identifies the organizations that have data.

In [120]:
interactive(OrganizationChoices, Organization=Organizations)

### Function that reads the selected csv into a dataframe

In [121]:
def CollectionChoices(Collection):
    global CollectionConceptsDF
    CollectionConceptsDF= pd.read_csv(os.path.join(OrganizationChoice, Collection))
    return CollectionConceptsDF

### Choose the CSV you want to examine

In [122]:
interactive(CollectionChoices, Collection=Collections)

Unnamed: 0,Collection,Dialect,Record,Concept,XPath,Content
0,GHRC,ISO,lohrac.xml,Abstract,/*/gmd:identificationInfo/*/gmd:abstract//*,The product is a 0.5 deg x 0.5 deg gridded com...
1,GHRC,ISO,lohrac.xml,Additional Attributes - Descriptive Keywords,/gmi:MI_Metadata/gmd:identificationInfo/*/gmd:...,EARTH SCIENCE>ATMOSPHERE>ATMOSPHERIC ELECTRICI...
2,GHRC,ISO,lohrac.xml,Additional Attributes - Descriptive Keywords,/gmi:MI_Metadata/gmd:identificationInfo/*/gmd:...,GLOBAL
3,GHRC,ISO,lohrac.xml,Additional Attributes - Descriptive Keywords,/gmi:MI_Metadata/gmd:identificationInfo/*/gmd:...,ANNUAL
4,GHRC,ISO,lohrac.xml,Additional Attributes - Descriptive Keywords,/gmi:MI_Metadata/gmd:identificationInfo/*/gmd:...,GHRC
5,GHRC,ISO,lohrac.xml,Additional Attributes - Descriptive Keywords,/gmi:MI_Metadata/gmd:identificationInfo/*/gmd:...,TRMM
6,GHRC,ISO,lohrac.xml,Additional Attributes - Descriptive Keywords,/gmi:MI_Metadata/gmd:identificationInfo/*/gmd:...,MICROLAB-1 > MICROLAB-1
7,GHRC,ISO,lohrac.xml,Additional Attributes - Descriptive Keywords,/gmi:MI_Metadata/gmd:identificationInfo/*/gmd:...,TRMM > TRMM
8,GHRC,ISO,lohrac.xml,Additional Attributes - Descriptive Keywords,/gmi:MI_Metadata/gmd:identificationInfo/*/gmd:...,OTD
9,GHRC,ISO,lohrac.xml,Additional Attributes - Descriptive Keywords,/gmi:MI_Metadata/gmd:identificationInfo/*/gmd:...,LIS


### We don't really need to know the organization and dialect after we have selected the organization and collection. The xpaths aren't needed as they are the xpaths that were looked for and in many cases are relative rather than the actual location within the record itself. Let's get rid of the columns we don't need.

In [123]:
CollectionConceptsDF.drop(['Collection','Dialect', 'XPath'], axis=1, inplace=True)
CollectionConceptsDF

Unnamed: 0,Record,Concept,Content
0,lohrac.xml,Abstract,The product is a 0.5 deg x 0.5 deg gridded com...
1,lohrac.xml,Additional Attributes - Descriptive Keywords,EARTH SCIENCE>ATMOSPHERE>ATMOSPHERIC ELECTRICI...
2,lohrac.xml,Additional Attributes - Descriptive Keywords,GLOBAL
3,lohrac.xml,Additional Attributes - Descriptive Keywords,ANNUAL
4,lohrac.xml,Additional Attributes - Descriptive Keywords,GHRC
5,lohrac.xml,Additional Attributes - Descriptive Keywords,TRMM
6,lohrac.xml,Additional Attributes - Descriptive Keywords,MICROLAB-1 > MICROLAB-1
7,lohrac.xml,Additional Attributes - Descriptive Keywords,TRMM > TRMM
8,lohrac.xml,Additional Attributes - Descriptive Keywords,OTD
9,lohrac.xml,Additional Attributes - Descriptive Keywords,LIS


### Now that we have the data we want, what are the understood concepts that exist in the collection?

In [124]:
ConceptVerticals=CollectionConceptsDF.Concept.unique()
Verticals=ConceptVerticals.tolist()
Verticals

['Abstract',
 'Additional Attributes - Descriptive Keywords',
 'Address',
 'Ancillary Keyword',
 'Bounding Box',
 'Browse Description',
 'Browse File Name',
 'Browse URL',
 'Cited Resource Identifier',
 'Cited Resource Title',
 'City',
 'Contact Instructions',
 'Coordinate Reference System (CRS)',
 'Country',
 'Data Dates',
 'Data Quality Scope',
 'Distribution Contact',
 'Easternmost Longitude',
 'Email',
 'End Time',
 'Enumerated Domain Value',
 'Enumerated Domain Value Definition Source',
 'Format of the Online Resource',
 'Geographic Description',
 'Instrument',
 'Instrument Keyword',
 'Instrument Keyword Vocabulary',
 'Instrument Short Name',
 'Instrument Type',
 'Keyword',
 'Keyword Type',
 'Keyword Vocabulary',
 'Keyword Vocabulary Citation',
 'Metadata Contact',
 'Metadata Dates',
 'Metadata Identifier',
 'Metadata Language',
 'Metadata Modified Date',
 'Metadata Standard Citation',
 'Metadata Standard Version',
 'Northernmost Latitude',
 'Online Resource',
 'Online Resource De

### Create function that allows us to call up metadata vertical content for a concept

In [125]:
def ConceptVerticalTable(Concept):
    global VerticalTable
    VerticalTable = CollectionConceptsDF[CollectionConceptsDF.Concept == Concept]
    return VerticalTable

### Create a dropdown using the function that allows us to create a dataframe of the concept you want as a metadata vertical.

In [126]:
interact(ConceptVerticalTable, Concept=Verticals) 

Unnamed: 0,Record,Concept,Content
66,lohrac.xml,End Time,2012-12-31T23:59:59.000Z
983,lohrfc.xml,End Time,2012-12-31T23:59:59.000Z
1900,lohrmc.xml,End Time,2012-12-31T23:59:59.000Z
2817,lolrac.xml,End Time,2012-12-31T23:59:59.000Z
3734,lolracts.xml,End Time,2012-12-31T23:59:59.000Z
4651,lolradc.xml,End Time,2012-12-31T23:59:59.000Z
5568,lolrdc.xml,End Time,2012-12-31T23:59:59.000Z
6485,lolrfc.xml,End Time,2012-12-31T23:59:59.000Z
7402,lolrmts.xml,End Time,2012-12-31T23:59:59.000Z
8319,lolrts.xml,End Time,2012-12-31T23:59:59.000Z


### Let's group the unique values in the content column and count them up.

In [129]:
VerticalTable.groupby('Content').size()

Content
1991-12-31T23:59:59.000Z     3
1992-01-04T23:59:59.000Z     1
1997-11-14T23:59:59.000Z     2
1997-11-15T23:59:59.000Z     1
1997-11-30T23:59:59.000Z     1
2000-05-16T23:59:59.000Z     2
2000-05-20T23:59:59.000Z     1
2000-05-31T23:59:59.000Z     1
2008-08-08T23:59:59.000Z     2
2008-08-09T23:59:59.000Z     1
2008-08-31T23:59:59.000Z     1
2009-11-04T23:59:59.000Z     2
2009-11-07T23:59:59.000Z     1
2009-11-30T23:59:59.000Z     1
2011-12-31T23:59:59.000Z     4
2012-12-31T23:59:59.000Z    10
dtype: int64

### Remove colons from the values so there are no Bokeh Label errors. Use Bokeh to plot a bar chart of the unique values.

In [130]:
data = VerticalTable.Content.str.replace(':','.')

p = Bar(data, 'Content', title="Vertical Value Occurance Count", legend=False)

output_file("bar.html")

show(p)

INFO:bokeh.core.state:Session output file 'bar.html' already exists, will be overwritten.


