# Simple notebook to explore EUDAT B2Find harvested metadata for ENES

Background: 
* EUDAT B2Find harvested ENES metadata consists of metadata for coarse grained data collections
* These coarse grained collections are assigned DOIs
* Metadata for ENES data harvested into the graph database from the ESGF federation is at file level and these files are then related to the collection levels they belong to
* To relate ENES EUDAT B2Find metadata to ENES ESGF metadata in the graph database some implicit domain knowledge is necessary
* This notebook illustrates this relation between ENES B2Find and ENES ESGF metadata for their integration in the neo4j database

Integration aspects:
* ENES ESGF metadata sometimes refers to newer versions of data entities
* ENES B2Find metadata refers to data collections which are assigned DOIs whereas ESGF metadata refers to data entities (individual files) which are assigned to unique IDs (and soon PIDs)

### Set up ckan client connection to EUDAT b2find service

In [2]:
import ckanclient
from pprint import pprint

ckan = ckanclient.CkanClient('http://b2find.eudat.eu/api/3/')

### Select ENES data subset in b2find harvested records

In [3]:
# restrict to few (2) results for the purpose of this notebook
q = 'tags:IPCC'
d = ckan.action('package_search', q=q, rows=2)

In [19]:
# 'title' provides the aggregation info for the data collection
# 'url' provides the doi of the data collection
# 'notes' contains information on how to interpret the aggregation info string in 'title'

for result in d['results']:
    print result['title']
    print result['title'].split()
    print result['url']
    print result['notes']
    print "----------------------------------------------------------------"
    #for part in result:
    #    print part,":-->", result[part]

cmip5 output1 MIROC MIROC5 historical
[u'cmip5', u'output1', u'MIROC', u'MIROC5', u'historical']
http://dx.doi.org/doi:10.1594/WDCC/CMIP5.MIM5hi
'historical' is an experiment of the CMIP5 - Coupled Model Intercomparison Project Phase 5
(http://cmip-pcmdi.llnl.gov/cmip5/). CMIP5 is meant to provide a framework for coordinated
climate change experiments for the next five years and thus includes simulations for
assessment in the AR5 as well as others that extend beyond the AR5.

3.2 historical (3.2 Historical) - Version 1: Simulation of recent past (1850 to 2005). Impose changing conditions (consistent with observations).

Experiment design: http://cmip-pcmdi.llnl.gov/cmip5/docs/Taylor_CMIP5_design.pdf
List of output variables: http://cmip-pcmdi.llnl.gov/cmip5/docs/standard_output.pdf
Output: time series per variable in model grid spatial resolution in netCDF format
Earth System model and the simulation information: CIM repository

Entry name/title of data are specified according to the D

### Hierarchy information for B2Find ENES data

In the harvested B2Find metadata an indication is given how to derive the hierarchy information:
"Entry name/title of data are specified according to the Data Reference Syntax
(http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf)
as activity/product/institute/model/experiment/frequency/modeling realm/MIP table/ensemble
member/version number/variable name/CMOR filename.nc"

In [17]:
# collection pattern (neo4j nodes for pattern parts)
# <activity>/<product>/<institute>/<model>/<experiment>/<frequency>/ 
# <modeling realm>/<mip table>/<ensemble member>/
# <version number>/<variable name>/<CMORfilename.nc>

# example title:   cmip5    output1   LASG-CESS FGOALS-g2 historicalNat
# collection info: activity product   institute model     experiment

def parse_collection_info(info_string):
    info_parts = info_string.split()
    pattern = ['activity','product','institute','model','experiment']
    result = dict(zip(pattern,info_parts))
    return result

for result in d['results']:
    parsed_result = parse_collection_info(result['title'])
    print parsed_result

{'institute': u'MIROC', 'product': u'output1', 'experiment': u'historical', 'model': u'MIROC5', 'activity': u'cmip5'}
{'institute': u'MPI-M', 'product': u'output1', 'experiment': u'sstClim', 'model': u'MPI-ESM-MR', 'activity': u'cmip5'}


### Relation to Neo4j ESGF graph nodes

The ESGF metadata harvesting and Neo4j graph generation is done in the script ENES-Neo4J-fill1.py
Each component of the collection hierarchy is assiged to a node connected with the "belongs_to" relationship and each component has a property name "name" corresponding to the values extracted from the B2Find result recods (see above). Additionally each collection has a level attribute 

experiment(6) -- belongs_to --> model(7) -- belongs_to --> institute(8) -- belongs_to --> product(9) -- belongs_to --> activity(10)

The B2Find metadata aggregates all collection levels below 6, thus the level 6 node has to be identified in the Neo4j ESGF graph and related to the corresponding B2Find information


#### cypher queries to identify corresponding level 6 nodes in ESGF graph structure: 

In [None]:
Match (n1:Collection {name:%experiment})-[r:belongs_to]->(n2:Collection {name:%model})-[r:belongs_to]
->(n3:Collection {name:%institute})-[r:belongs_to]->(n4:Collection {name:%product})-[r:belongs_to]
->(n5:Collection {name:%activity})