##  This notebook allows the user to look at all the content a specific concept contains, collection wide. It utilizes the data.csv created by the Evaluator service used in the Metadata2Data notebook.

Read in a csv that the user selects by selecting the organization and collection. 

In [1]:
import pandas as pd
import os
from os import walk
import fnmatch
from ipywidgets import *
import ipywidgets as widgets
from bokeh.charts import output_notebook, show, Bar
from bokeh.plotting import figure
from bokeh.models import Range1d, HoverTool, ResizeTool
from bokeh.charts import defaults
from bokeh.models import Legend
pd.options.display.max_colwidth=200
defaults.width = 1800
defaults.height = 1000
output_notebook()
#from ipywidgets import Button, Layout
#from glob import glob

### query the directory for data files

Create a list of the data files.

In [2]:
DataFiles=[]
for path, subdirectories, filenames in os.walk('../data/'):
    for filename in filenames:
        if fnmatch.fnmatch(filename, '*.csv'):
            DataFiles.append(os.path.join(path,filename).split("../data/", 1)[-1])
DataFiles        

['BCO-DMO/GeoTraces_ISO_data.csv',
 'DataONE/CDL_CSDGM_data.csv',
 'DataONE/Dryad_Dryad_data.csv',
 'DataONE/EDACGSTORE_CSDGM_data.csv',
 'DataONE/EDORA_Mercury_data.csv',
 'DataONE/ESA_EML_data.csv',
 'DataONE/GLEON_EML_data.csv',
 'DataONE/GOA_EML_data.csv',
 'DataONE/IARC_Onedcx_data.csv',
 'DataONE/IOE_EML_data.csv',
 'DataONE/KNB_EML_data.csv',
 'DataONE/KUBI_EML_data.csv',
 'DataONE/LTER_EML_data.csv',
 'DataONE/LTER_EUROPE_EML_data.csv',
 'DataONE/NMEPSCOR_CSDGM_data.csv',
 'DataONE/ONEShare_EML_data.csv',
 'DataONE/ORNLDAAC_Mercury_data.csv',
 'DataONE/PISCO_EML_data.csv',
 'DataONE/RGD_Mercury_data.csv',
 'DataONE/SANPARKS_EML_data.csv',
 'DataONE/SEAD_CSDGM_data.csv',
 'DataONE/TERN_EML_data.csv',
 'DataONE/TFRI_EML_data.csv',
 'DataONE/US_MPC_Onedcx_data.csv',
 'DataONE/USANPN_EML_data.csv',
 'DataONE/USGSCSAS_BDP_data.csv',
 'DataONE/USGSCSAS_CSDGM_data.csv',
 'IEDA/ECL_DCITE_data.csv',
 'IEDA/MarineGeoscienceDataSystem_ISO_data.csv',
 'LTERthroughTime/LTER_2005_EML_data.cs

### Choose the data you want to look at by creating a dropdown with a function that uses the list of files

Function that reads the selected csv into a dataframe

In [3]:
def DataChoices(DataFile):
    global CollectionConceptsDF
    CollectionConceptsDF= pd.read_csv(os.path.join('../data', DataFile))
    return CollectionConceptsDF

Choose the CSV you want to examine. The default dataframe is created from data.csv 

In [4]:
interactive(DataChoices, DataFile=DataFiles)

###  Let's get rid of the columns we won't use for this analysis. We already know the organization and dialect after we have selected the organization and collection. The xpaths aren't needed either.

In [17]:
CollectionConceptsDF.drop(['Collection','Dialect', 'XPath', 'DialectDefinition','DocumentLocation'], axis=1, inplace=True)
CollectionConceptsDF

Unnamed: 0,Record,Concept,Content
0,dataset_641044.xml,Unknown,http://www.isotc211.org/2005/gmi http://www.ngdc.noaa.gov/metadata/published/xsd/schema.xsd
1,dataset_641044.xml,Metadata Identifier,http://lod.bco-dmo.org/id/dataset/641044
2,dataset_641044.xml,Metadata Language,eng; USA
3,dataset_641044.xml,Unknown,utf8
4,dataset_641044.xml,Unknown,http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#MD_CharacterSetCode
5,dataset_641044.xml,Unknown,utf8
6,dataset_641044.xml,Resource Type,dataset
7,dataset_641044.xml,Unknown,"Highest level of data collection, from a common set of sensors or instrumentation, usually within the same research project"
8,dataset_641044.xml,Metadata Contact,Biological and Chemical Oceanography Data Management Office (BCO-DMO) Unavailable 508-289-2009 WHOI MS#36 Woods Hole MA 02543 USA info@bco-dmo.org http://www.bco-dmo.org Monday - Friday 8:00am - 5...
9,dataset_641044.xml,Metadata Modified Date,2016-03-18


### Now that we have the data we want, what are the concepts that exist in the collection?

Crreate a list of unique items from the Concept column

In [18]:
ConceptVerticals=CollectionConceptsDF.Concept.unique()
Verticals=ConceptVerticals.tolist()
Verticals

['Unknown',
 'Metadata Identifier',
 'Metadata Language',
 'Resource Type',
 'Metadata Contact',
 'Metadata Modified Date',
 'Metadata Dates',
 'Metadata Standard Citation',
 'Metadata Standard Version',
 'Publication Information',
 'Abstract',
 'Acknowledgement',
 'Resource Status',
 'Resource Contact',
 'Resource Update Frequency',
 'Theme Keyword',
 'Keyword',
 'Keyword Type',
 'Keyword Vocabulary',
 'Keyword Vocabulary Citation',
 'Instrument Keyword',
 'Place Keyword',
 'Rights',
 'Association',
 'Resource Language',
 'Topic Category',
 'Geographic Description',
 'Spatial Extent',
 'Temporal Extent',
 'Feature Catalogue Citation',
 'VariableType',
 'Attribute Label',
 'DataType',
 'Attribute Definition',
 'Units',
 'Distribution Contact',
 'URL',
 'Product Link',
 'Data Quality Scope',
 'Resource Lineage',
 'Reprocessing Plan Note',
 'Responsibility',
 'Instrument',
 'Related Resource Identifier',
 'Cited Resource Title',
 'Platform Short Name']

Create function that allows us to call up metadata vertical content for a concept

In [19]:
def ConceptVerticalTable(Concept):
    global VerticalTable
    VerticalTable = CollectionConceptsDF[CollectionConceptsDF.Concept == Concept]
    return VerticalTable

Create a dropdown using the function that allows us to create a dataframe of the concept you want as a metadata vertical.

In [20]:
interact(ConceptVerticalTable, Concept=Verticals) 

Unnamed: 0,Record,Concept,Content
20,dataset_641044.xml,Resource Update Frequency,asNeeded
456,dataset_3843.xml,Resource Update Frequency,asNeeded
979,dataset_3515.xml,Resource Update Frequency,asNeeded
1190,dataset_3838.xml,Resource Update Frequency,asNeeded
1486,dataset_472814.xml,Resource Update Frequency,asNeeded
1922,dataset_3844.xml,Resource Update Frequency,asNeeded
2375,dataset_3836.xml,Resource Update Frequency,asNeeded
2783,dataset_647606.xml,Resource Update Frequency,asNeeded
3186,dataset_3831.xml,Resource Update Frequency,asNeeded
3517,dataset_648030.xml,Resource Update Frequency,asNeeded


Let's group the unique values in the content column and count them up.

In [21]:
VerticalTable.groupby('Content').size()

Content
asNeeded    117
dtype: int64

Use Bokeh to plot a bar chart of the unique values. Remove colons from the values so there are no Bokeh Label errors. 

In [22]:
data = VerticalTable.Content.str.replace(':','.')

p = Bar(data, 'Content', title="Vertical Value Occurance Count", legend=False)

show(p)