##  This notebook allows the user to look at all the content a specific concept contains, collection wide. It utilizes the data.csv created by the Evaluator service used in the Metadata2Data notebook.

Read in a csv that the user selects by selecting the organization and collection. 

In [1]:
import pandas as pd
import os
from os import walk
import fnmatch
from ipywidgets import *
import ipywidgets as widgets
from bokeh.charts import output_notebook, show, Bar
from bokeh.plotting import figure
from bokeh.models import Range1d, HoverTool, ResizeTool
from bokeh.charts import defaults
from bokeh.models import Legend
pd.options.display.max_colwidth=200
defaults.width = 1800
defaults.height = 1000
output_notebook()
#from ipywidgets import Button, Layout
#from glob import glob

### query the directory for data files

Create a list of the data files.

In [2]:
DataFiles=[]
for path, subdirectories, filenames in os.walk('../data/'):
    for filename in filenames:
        if fnmatch.fnmatch(filename, '*.csv'):
            DataFiles.append(os.path.join(path,filename).split("../data/", 1)[-1])
DataFiles        

['data.csv',
 'dataForRAD.csv',
 'BCO-DMO/GeoTraces_ISO_data.csv',
 'DataONE/CDL_CSDGM_data.csv',
 'DataONE/CLOEBIRD_EML_data.csv',
 'DataONE/Dryad_Dryad_data.csv',
 'DataONE/EDACGSTORE_CSDGM_data.csv',
 'DataONE/EDORA_Mercury_data.csv',
 'DataONE/ESA_EML_data.csv',
 'DataONE/GLEON_EML_data.csv',
 'DataONE/GOA_EML_data.csv',
 'DataONE/IARC_Onedcx_data.csv',
 'DataONE/IOE_EML_data.csv',
 'DataONE/KNB_EML_data.csv',
 'DataONE/KUBI_EML_data.csv',
 'DataONE/LTER_EML_data.csv',
 'DataONE/LTER_EUROPE_EML_data.csv',
 'DataONE/NMEPSCOR_CSDGM_data.csv',
 'DataONE/ONEShare_EML_data.csv',
 'DataONE/ORNLDAAC_Mercury_data.csv',
 'DataONE/PISCO_EML_data.csv',
 'DataONE/RGD_Mercury_data.csv',
 'DataONE/SANPARKS_EML_data.csv',
 'DataONE/SEAD_CSDGM_data.csv',
 'DataONE/TERN_EML_data.csv',
 'DataONE/TFRI_EML_data.csv',
 'DataONE/US_MPC_Onedcx_data.csv',
 'DataONE/USANPN_EML_data.csv',
 'DataONE/USGSCSAS_BDP_data.csv',
 'DataONE/USGSCSAS_CSDGM_data.csv',
 'IEDA/ECL_DCITE_data.csv',
 'IEDA/MarineGeoscienc

### Choose the data you want to look at by creating a dropdown with a function that uses the list of files

Function that reads the selected csv into a dataframe

In [3]:
def DataChoices(DataFile):
    global CollectionConceptsDF
    CollectionConceptsDF= pd.read_csv(os.path.join('../data', DataFile))
    return CollectionConceptsDF

Choose the CSV you want to examine. The default dataframe is created from data.csv 

In [4]:
interactive(DataChoices, DataFile=DataFiles)

###  Let's get rid of the columns we won't use for this analysis. We already know the organization and dialect after we have selected the organization and collection. The xpaths aren't needed either.

In [5]:
CollectionConceptsDF.drop(['Collection','Dialect', 'XPath', 'DialectDefinition'], axis=1, inplace=True)
CollectionConceptsDF

Unnamed: 0,Record,Concept,Content
0,C1282709905.xml,Metadata Identifier,gov.nasa.echo:PODAAC-GHMTB-2PN02
1,C1282709905.xml,Metadata Language,eng
2,C1282709905.xml,Unknown,utf8
3,C1282709905.xml,Unknown,http://www.ngdc.noaa.gov/metadata/published/xsd/schema/resources/Codelist/gmxCodelists.xml#MD_CharacterSetCode
4,C1282709905.xml,Unknown,utf8
5,C1282709905.xml,Resource Type,series
6,C1282709905.xml,Metadata Contact,PO.DAAC pointOfContact
7,C1282709905.xml,Metadata Modified Date,2016-10-05T18:46:42.041Z
8,C1282709905.xml,Metadata Dates,2016-10-05T18:46:42.041Z
9,C1282709905.xml,Metadata Standard Citation,ISO 19115-2 Geographic Information - Metadata Part 2 Extensions for imagery and gridded data


### Now that we have the data we want, what are the concepts that exist in the collection?

Crreate a list of unique items from the Concept column

In [6]:
ConceptVerticals=CollectionConceptsDF.Concept.unique()
Verticals=ConceptVerticals.tolist()
Verticals

['Metadata Identifier',
 'Metadata Language',
 'Unknown',
 'Resource Type',
 'Metadata Contact',
 'Metadata Modified Date',
 'Metadata Dates',
 'Metadata Standard Citation',
 'Metadata Standard Version',
 'Spatial Representation',
 'Publication Information',
 'Abstract',
 'Purpose',
 'Resource Contact',
 'Resource Format',
 'Theme Keyword',
 'Keyword',
 'Keyword Type',
 'Keyword Vocabulary',
 'Keyword Vocabulary Citation',
 'Place Keyword',
 'Platform Keyword',
 'Platform Keyword Vocabulary',
 'Instrument Keyword',
 'Resource Language',
 'Geographic Description',
 'Spatial Extent',
 'Temporal Extent',
 'Supplemental Information',
 'Related Resource Identifier',
 'VariableType',
 'Processing Level',
 'Distribution Contact',
 'Resource Cost or Fees',
 'Distribution Format',
 'URL',
 'Related URL',
 'Data Quality Scope',
 'Resource Lineage',
 'Instrument',
 'Platform',
 'Browse File Name',
 'Browse Description',
 'Browse Format']

Create function that allows us to call up metadata vertical content for a concept

In [7]:
def ConceptVerticalTable(Concept):
    global VerticalTable
    VerticalTable = CollectionConceptsDF[CollectionConceptsDF.Concept == Concept]
    return VerticalTable

Create a dropdown using the function that allows us to create a dataframe of the concept you want as a metadata vertical.

In [8]:
interact(ConceptVerticalTable, Concept=Verticals) 

<function __main__.ConceptVerticalTable>

Let's group the unique values in the content column and count them up.

In [10]:
VerticalTable.groupby('Content').size()

Content
Carpinteria Reef              1
FOR01                         1
FOR04                         1
FOR05                         1
Forereef                      1
Georgia                       3
Goleta Bay                    1
Great Lakes                   2
Hickory Corners               2
Ipswich River Watershed       7
KBS                           2
Kellogg Biological Station    2
LTER                          2
Marsh Landing                 3
Massachusetts                 7
McMurdo Dry Valleys           2
Michigan                      2
Mohawk Reef                   1
Moorea Coral Reef             1
New England                   7
PIE LTER                      7
Parker River Watershed        3
Plum Island Ecosystems        7
Rattlesnake Creek             1
Santa Barbara                 1
Sapelo Island                 3
USA                           3
United States                 7
dtype: int64

Use Bokeh to plot a bar chart of the unique values. Remove colons from the values so there are no Bokeh Label errors. 

In [11]:
data = VerticalTable.Content.str.replace(':','.')

p = Bar(data, 'Content', title="Vertical Value Occurance Count", legend=False)

show(p)