In [1]:
import numpy
import scipy
import json
import urlparse
import requests
from IPython.display import Image

# Lightweight Data Systems Workshop May 2015

Convened by Anthony Arendt (UW/APL), Rob Fatland (MSR), Joe Futrelle (WHOI), Nancy Hess (PNNL), Bill Howe (UW), Lee Ann McCue (PNNL)

#### Note that 'Kilroy' flags a point as needing work
#### Note that 'LWDS' is the preferred abbreviation
#### Other notes here...

### Introduction and Motivation

This two-day Lightweight Data Systems (LWDS) workshop is a collective thought process effort – 
responding to a common impulse in [earth system] science: 
“I’d like to build a small data management system; so how should I proceed?”


Drawing from our collective domain interests we are exploring 
“What are the considerations we can share in response?” 
towards data management that does more than merely check the DMP box in an NSF proposal.


Domain research is complex and detailed; so with this topic. 
We will consider a number of leading questions; including 
‘How far down the complexity rabbit hole can such a system go?’ 
i.e. if research divides roughly into perfunctory and exploratory phases: 
When why and how can perfunctory analysis be formalized as a service 
that everyone can use? 
And is there a commensurate Triumph of the Commons where this community benefits from sharing data?

### Workshop Outcomes

##### Outcome 1 The FAQ + Paper 1 + Paper 2

The FAQ is a hypothetical document.

Paper 1 is an opinion piece; to Nancy.

Paper 2 is a write-up of technical components of the workshop discussion; to Rob.

##### Outcome 2 The Landscape

##### Outcome 3 The Roadmap

##### Outcome 4 The Curriculum

##### Outcome 5 The Road Test Workshop



### Guidelines 

Written with the "modest research team" as the target audience. Not necessarily but quite possibly in a scientific domain.

Carry forward a cognizance of meters/needles, e.g. ridiculous ------- sublime).

Specificity flow: From Domains > Data > Data Management > Perhaps an LWDS solution
   
Domain considerations are drivers; “Where is this going?"

Wide-world considerations: "What else could/should we be thinking about?"     

One can ask “How does this generalize? How does it extend?” where the answer should sound like
"Not much extra work was involved and we are now part of a growing community of LWDS-builders..."


##### Things to Provide

Basic explanation / lexicon
1. What is the cloud? (disk / server / apps / services / ...)
2. What is iPython? A shell! (http://en.wikipedia.org/wiki/IPython)
3. What is PostgreSQL? A free database system with geospatial support
4. And so on
    
Technology evaluation
1. What role does it play? 
2. What are tradeoffs in its use? 
3. How much time to learn?  



Mike's Law: Everyone should have their own law. It is the law. Somewhere.

Bill’s Law: Provide a service to incentivize and grow community adoption

Rob’s Law: Rob doesn’t know what he is doing. 
    (Trans: Data is never used in the manner in which it is acquired. 
     (Trans: Decouple scientist from data use.)) 


### Technologies

##### Cloud solutions (Azure, Myria, SQL Share, ...)

##### Languages (Python, ...)

SYWTBALWDS

advice There are many programming languages available, they can be integrated, and they have attributes that 
will make them more or less appealing to apply to your data science problems. 

Let's describe programming languages and how they can be integrated.

##### Web Application Frameworks (Django, ...)

##### Command shells (PowerShell, ssh, Ipython, ...)

##### Analysis tools (AzureML, R, ...)


### BLD Birth Life and Death of an LWDS

We maintain that BLD is a necessary component of one's approach to LWDS 

We say 'Organized data will outlive the parent data system.
Perhaps it runs for 2 years.  

Aggregate data
Fuel papers 
Export the database shut it down

(If the LWDS is still in high demand: That's a nice problem to have.)

### Data Taxonomy

- In situ
- Virtual sensors
- Remote sensing
- Lab work
- Sample processing
- Incubation studies
- Types: Mass spec flavors + LC/GC, fluorescence, color, sequencing, various isotope studies, ...
- Levels
- Volume
- For example assay versus spectral data
- Metadata


### Source Topics

- Confederation
- Adoption incentives
- Metadata
- Services
- APIs
- Databases
- Testing
- Make My Solution Replicable
- Open Source
- Data Contribution
- Pass-through
- Documentation
- Query
- Attribution
- Contribution paths
- Discovery
- Portalization
- Standards and data formats
- Data qualifier concepts: classes/types/artifacts/levels/
- “SEP Clients” where SEP = Someone Elses Problem: Pushing the task of Client-building off the to-do list.
- Visualization
- Analysis

### Stretch thinking

Suppose one had a Meta-LWDS Crawler analogous to PolarHub (Li). 
It waits to hear from and then brokers simple RESTFUL APIs.


- First let us build an LWDS and a special Python library called 'bee.py'
- bee.py uses JSON-RPC or equivalent protocol
- bee.py provides my LWDS with a generic API 
- Our LWDS has no web page, no portal, no HTML


- Second let us build the Crawler
- It Listens for messages from a bee.py instance  
- bee.py periodically attempts to register my LWDS with the Crawler 


Upon Success: 
    
- The Crawler mirrors my LWDS API. 
- My data is available and findable. 
- I did not write a single web page or install and wrestle with Django. 


### Whiteboard Redux Section

#### South Wall (main effort)

##### Data System Taxonomy with examples

- Micro LWDS: Excel spreadsheet
- Small LWDS: SQL Share, Rob's Biogeochem Data System
- Large LWDS: Heidi's Flow Cytometry, FetchClimate, 
- Small DS: BCO DMO? GeoTraces?
- Large DS: Ameriflux
- National DS: NEON, NASA DACs
- Meta DS: Earthcube
    
##### An Elephant In The Room: Climate Change

##### An Elephant In The Room: The Boneyard of Unused and Under-used Data Systems

##### The Diffusion Problem

Imagine a landscape of question-marks representing researchers. Down is time.
Off to the upper left at some time a person invents a solution. Example: Generic Mapping Tool. 
What is the *process* by which this solution propagates through the community? 
GMT is prevalent and has been for over 20 years as the premier way of putting a ROI on a map 
with some custom symbols indicating the details of a study site. It is highly successful... why?


##### Balancing Act

Joe's excellent point is: Don't engineer more solution than you need. 
Kilroy: Joe to provide a further exegesis on his process.
Bill's excellent point is: Let's make solutions include global access (Joe's 'ubiquity')
In the balance hangs Time and Money. 
Under "Just Enuf" on the balance we wrote "Immediate Problem". 
Under "Global Access" on the opposite side of the balance we wrote "Generic Considerations" 
and this could also be written "Community Considerations". Or cultural shift or ...


On the side of making it easy to build something we have the 'Better that 100 flowers grow' advocacy (Bill) 
that I would translate as: Don't raise the barrier to entry for some formal rationale that effectively 
discourages participation; but rather do the reverse. 



##### Awareness and Solutions

These are important consideration that our LWDS can get up to speed on through Deliverable 1.


- Metadata (which is essential for search; and that can be the enemy of use (publication))
- How APIculture is overcoming Clicki-culture
- Understand and work from Design Patterns (Joe's emphasis)
- ( Success Recipe ) that leverages existing X. Example: SQL Share published as URL to CINERGI


Kilroy there is more missing from a photo of lower down on the board


##### Taxonomy of Archtype Solutions

Some, not all! In order of mass:


- ftp file
- file system organization
- SQL Share
- Myria
- Sharepoint
- Git hub not as jumble-box but as template of how to structure a product (like a paper or an LWDS) (here is a folder for your figures)
- Heidi Sosik's work on FC
- iPython Notebook
- Fill in here (Kilroy): The story of MODIS and by extension GEE


##### Civilized Institutionalized Data Systems 

- GINA
- Ice2Ocean
- LiveOcean
- NOAA: IOOS, ...
- NASA: ECHO, ECHO2, ...
- USDA, USGS, other gov ...
- NEON
- Ameriflux
- LTER
- Lamont-Doherty: R2R, Various Directories
- L-D GeoLink: Look into this further (Kilroy): Authority, tracking, notifications
- Argh what is that European mass spec one? (Kilroy; chap at the NYC mtg runs it)
- BCO DMO
- NSIDC: ACADIS, GLIMS (in relation to Anthony), Ted Scambos' stuff, sea ice extent, ...
- JGI: Genomics
- BLAST: A tool 
- DOM @ WHOI (??? Kilroy ???)
- Rob F Biogeochem Data System
- Geotraces
- Harmonized World Soil Database (PNNL knows well)
- FetchClimate + ME out of MSR Cambridge
- The ARGO story
- Kilroy there are others missing from a photo of lower down

###### Meta...

- CINERGI (Kilroy what is the experience of contributing??)
- GEOSS
- ESIP

##### It should be true...

We drew a progression from one person to two to five to 10,000 (researchers) to a fishhook (non-research community) to earth.

This is intended to represent scale of impact. 

It should be true that 2 -- 10,000 is the same solution. 

(Kilroy: "Soln criteria: Does it --> Box with little circles and big N" means what?)


#### West Wall (pyramid diagram)

##### Nees Heirarchy of Science Data Management

After Maslow, to paraphrase: As a need is met the next level up dominations conscious functioning. 

Bill has the source slide. His point was to swap integration and sharing. Sharing is easy; integration is hard.


- Integration
- Analytics
- Query
- Sharing
- Storage


### Whiteboard Source Photos

In [2]:
splashImage = Image(url='http://robfatland.net/LWDSFigs/SplashImage.png')
# splashImage
print "This image is cool and evokes flow cytometry; but it is not doing any work at the moment so it is commented out."

This image is cool and evokes flow cytometry; but it is not doing any work at the moment so it is commented out.


In [3]:
wb1Image = Image(url='http://robfatland.net/LWDSFigs/WB1.jpg')
wb1Image
print "This image is redux'd above and is not shown."

This image is redux'd above and is not shown.


In [4]:
wb2Image = Image(url='http://robfatland.net/LWDSFigs/WB2.jpg')
print "This image has been reduxed above."
# wb2Image

This image has been reduxed above.


In [5]:
wb3Image = Image(url='http://robfatland.net/LWDSFigs/WB3.jpg')
wb3Image
print "This photo has been redux'd"

This photo has been redux'd


In [6]:
wb4Image = Image(url='http://robfatland.net/LWDSFigs/WB4.jpg')
wb4Image
# Kilroy please redux this image

In [7]:
wb5Image = Image(url='http://robfatland.net/LWDSFigs/WB5.jpg')
wb5Image
# Kilroy please redux this image

In [8]:
wb6Image = Image(url='http://robfatland.net/LWDSFigs/WB6.jpg')
wb6Image
# Kilroy please redux this image

In [9]:
wb7Image = Image(url='http://robfatland.net/LWDSFigs/WB7.jpg')
wb7Image
# Kilroy please redux this image

In [10]:
# a=wb7Image.size()
dir(Image)
a = wb7Image
print a.__sizeof__
# print wb7Image.getattribute()
# argh; this doesn't work (trying to reduce image size)
# IPython.core.display.display(wb7Image)

<built-in method __sizeof__ of Image object at 0x7f2ce121e3d0>


### Deliverable 1A: LWDS Paper 1

Nancy Hess has the lead

### Deliverable 1B: LWDS Paper 2

###### Rob Fatland has the lead

##### Abstract
Lorem ipsum.

##### Introduction

Lorem.

##### Open Data Movement

Lorem.

### Deliverable 1C: FAQ

#### Basics

Q: I am a domain scientist with a data problem. What are the top five things I should think about? 

A: Scope, collaboration, design patterns, leverage and tradeoffs. Here is a bit more elaboration:
1. On a 3-axis diagram of data volume, complexity and perfunction-versus-exploration: Where do you live?
2. Do you collaborate with at least one other person; and are you an open-data-philosophy subscriber?
3. Please look through our design patterns and see what resonates.
4. Please look through our technologies and see what you can use that is already built and tested. 
5. How much time, money, and pre-existing technology skills do you have on hand? See our guide below.


Q: How are you going to try and talk me into embarking on a formal LWDS process; rather than
just cobbling something together and getting on with my life? 

A: The good news is that we are going to advocate for you to build the minimal necessary solution,
nothing more.  But then we will complicate your life slightly by asking you to consider sharing out 
what you do with a broad audience. In general a 'formal process' only makes sense if you carefully
think about where you are headed and realize 'cobbling together' is going to end up costing you time.

#### Volume - Complexity - Perfunction Diagram

#### Design Patterns

#### Technologies

#### Time Cash Patience Guide

### Deliverable 2: Landscape

Please let us begin with CINERGI

### Deliverable 3: Roadmap



### Deliverable 4: Curriculum

"Design a course that produces an LWDS-capable scientist"

##### Skills

- Write DFT: Equation to Code to Testing

- Create a Cloud service that produces a synthetic data stream from at least three disparate data sources.

- Translate a physical problem built on differential equations into artifact-free code

##### Learning Skills

- Bootstrap skills in a non-mainstream programming paradigm; e.g. functional programming

- Build a sensor-to-cloud datastream using Maker community resources.


### Deliverable 5: Road-test Workshop


From an email to Chuck Seers, Pieter Dorrestein, Wenwen Li (sent after the workshop):  


I noted that these persons are respectively at OSU and UCSD and Arizona State. 
They work in environmental science informatics. 
They were interested in but unable to attend the LWDS workshop. 

The workshop will produce:


1. Two papers (opinion piece and paper proper)
2. Recommendations for curriculum in support of LWDS-enabled domain scientists
3. Landscape overview: What is out there
4. Roadmap: What is possible (informal report)
5. Roadtest workshop: How do the recommendations we produce in (2) translate into action?


Suppose a domain scientist has a data problem to solve.
We provide guidelines in arriving at a solution based on some flow-chart-like approach.
Suppose this scientist is working collaboratively and therefore wishes to share data. 
With one other person? With 2? With 7? 
We would like to make that sharing trivially scale up to 10,000 as easily as 2. 

Ilya Zaslavsky at SDSC has invented CINERGI out of Earthcube and we find that 
to be exactly what we would wish for: A directory of CI for geoscience where it 
is easy to register new resources. So we want to make this a big part of our 
Landscape story.  Pieter maybe you know much more about this… but it is a pleasant 
case of “Here is the apparent solution; now how well is it going to work in practice?” 


Hosted iPython Notebooks and SQL Share are two very short-path approaches to getting LWDS done. 


