##  This notebook allows the user to select XML collections and zip them up to send to a service that runs a transform on them and returns a simple CSV made up of six data points. The data included is the Collection name, Dialect name, Record name, Concept name, Content, Xpath location, and the Dialect Definition for the concept. 

## This CSV contains a row for each concept that is found, so some locations may fulfill multiple concepts. A good example of this are the cncepts Keyword and Place Keyword. Every Place Keyword is also a Keyword, so the row would repeat with a different Concept name. It also contains a row for each undefined node that contains text, marking these rows with an Unknown in the Concept column. 

## This data can be used in a variety of analyses including RAD and QuickE as well as Concept Verticals. It can also be used to teach the system dialect definitions for concepts that are currently unknown by exposing all of the content at undefined nodes. 

In [1]:
%%HTML
<img src=https://image.slidesharecdn.com/scgordonesipwinter2017-170125170939/95/recommendations-analysis-dashboard-1-1024.jpg>

In [2]:
import pandas as pd
import os
from os import walk
import shutil
from ipywidgets import *
import ipywidgets as widgets

In [3]:
Organizations = []
for (dirpath, dirnames, filenames) in walk('../collection/'):
    Organizations.extend(dirnames)
    break  

In [4]:
def OrganizationChoices(organization):
    global OrganizationChoice
    global Organization
    Organization=organization
    print("Organization of the collection is", Organization)


In [5]:
interactive(OrganizationChoices, organization=Organizations)

Organization of the collection is BCO-DMO


In [6]:
Collections = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization)):
    Collections.extend(dirnames)
    break 
Collections

['GeoTraces']

In [7]:
def CollectionChoices(collection):
    global CollectionChoice
    global Collection
    Collection=collection

In [8]:
interactive(CollectionChoices, collection=Collections)

In [9]:
Dialects = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization,Collection)):
    Dialects.extend(dirnames)
    break 
dialectList=Dialects


In [10]:
def dialectChoice(dialect):
    global Dialect
    Dialect=dialect
    print("Dialect of the collection is", Dialect)


In [11]:
interactive(dialectChoice,dialect=dialectList)

Dialect of the collection is ISO


In [12]:
cd ../zip

/Users/scgordon/MILE2/zip


In [13]:
MetadataDestination=os.path.join(Organization,Collection,Dialect,'xml')
MetadataDestination

'BCO-DMO/GeoTraces/ISO/xml'

In [14]:
os.makedirs(MetadataDestination, exist_ok=True)

In [15]:
MetadataLocation=os.path.join('../collection/',Organization,Collection,Dialect,'xml')

MetadataLocation

'../collection/BCO-DMO/GeoTraces/ISO/xml'

In [16]:
src_files = os.listdir(MetadataLocation)
for file_name in src_files:
    full_file_name = os.path.join(MetadataLocation, file_name)
    if (os.path.isfile(full_file_name)):
        shutil.copy(full_file_name, MetadataDestination)

In [17]:
shutil.make_archive('../upload/metadata', 'zip', os.getcwd())

'/Users/scgordon/MILE2/upload/metadata.zip'

In [18]:
cd ../upload

/Users/scgordon/MILE2/upload


In [19]:
%%bash
curl -o ../data/data.csv -F "zipxml=@metadata.zip" http://metadig.nceas.ucsb.edu/metadata/evaluator

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 2000k    0     0  100 2000k      0  1775k  0:00:01  0:00:01 --:--:-- 1921k100 2000k    0     0  100 2000k      0   938k  0:00:02  0:00:02 --:--:--  977k100 2000k    0     0  100 2000k      0   637k  0:00:03  0:00:03 --:--:--  655k100 2000k    0     0  100 2000k      0   483k  0:00:04  0:00:04 --:--:--  493k100 2000k    0     0  100 2000k      0   388k  0:00:05  0:00:05 --:--:--  395k100 2000k    0     0  100 2000k      0   325k  0:00:06  0:00:06 --:--:--     0100 2000k    0     0  100 2000k      0   279k  0:00:07  0:00:07 --:--:--     0100 2000k    0     0  100 2000k      0   245k  0:00:08  0:00:08 --:--:--     0100 2000k    0     0  100 2000k      0   218k  0:00:09  0:00:09 --:--:--     0100 2000k    0     0  100 2000k      0   196k  0:00

In [20]:
%cd ../
shutil.rmtree('upload')
%cd zip
shutil.rmtree(Organization)
%cd ../data

/Users/scgordon/MILE2
/Users/scgordon/MILE2/zip
/Users/scgordon/MILE2/data


In [21]:
CollectionConceptsDF= pd.read_csv('data.csv')
CollectionConceptsDF

Unnamed: 0,Collection,Dialect,Record,Concept,Content,XPath,DialectDefinitions
0,GeoTraces,ISO,dataset_643592.xml,Unknown,http://www.isotc211.org/2005/gmi http://www.ng...,/gmi:MI_Metadata/@xsi:schemaLocation,/gmi:MI_Metadata/@xsi:schemaLocation
1,GeoTraces,ISO,dataset_643592.xml,Metadata Identifier,http://lod.bco-dmo.org/id/dataset/643592,/gmi:MI_Metadata/gmd:fileIdentifier/gco:Charac...,/*/gmd:fileIdentifier//*
2,GeoTraces,ISO,dataset_643592.xml,Metadata Language,eng; USA,/gmi:MI_Metadata/gmd:language/gco:CharacterString,/*/gmd:language//*
3,GeoTraces,ISO,dataset_643592.xml,Unknown,utf8,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...
4,GeoTraces,ISO,dataset_643592.xml,Unknown,http://www.isotc211.org/2005/resources/Codelis...,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...
5,GeoTraces,ISO,dataset_643592.xml,Unknown,utf8,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...
6,GeoTraces,ISO,dataset_643592.xml,Resource Type,dataset,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_Sco...,/*/gmd:hierarchyLevel/gmd:MD_ScopeCode
7,GeoTraces,ISO,dataset_643592.xml,Unknown,"Highest level of data collection, from a commo...",/gmi:MI_Metadata/gmd:hierarchyLevelName/gco:Ch...,/gmi:MI_Metadata/gmd:hierarchyLevelName/gco:Ch...
8,GeoTraces,ISO,dataset_643592.xml,Metadata Contact,Biological and Chemical Oceanography Data Mana...,/gmi:MI_Metadata/gmd:contact,/*/gmd:contact
9,GeoTraces,ISO,dataset_643592.xml,Metadata Modified Date,2016-04-26,/gmi:MI_Metadata/gmd:dateStamp/gco:Date,/*/gmd:dateStamp/gco:Date


In [22]:
shutil.copy("data.csv", os.path.join(Organization,Collection+'_'+Dialect+'_'+'data.csv'))

'BCO-DMO/GeoTraces_ISO_data.csv'

### Now that we have our metadata data prepared and stored, we can look at collection analytics, cross collection analytics, and concept verticals.

In [None]:
#figure out how to link other notebooks, especially nice if it's possible to pass the current dataframe