<h1 style="align: center;"><img style="display: inline;" src="https://raw.githubusercontent.com/pangeo-data/branding/master/logo/v_small_pangeo_logo.png"> ❤️ OSN</h1>

## Interactive Data-Proximate Computing in the Cloud 


Slides for a talk at the 2021-04-29 [Open Storage Network Seminar](https://www.openstoragenetwork.org/seminar-series/apr-29-2021-data-sharing-and-distributed-storages-role-in-research-next-steps/)

[![Binder](https://mybinder.org/badge_logo.svg)](https://binder.pangeo.io/v2/gh/rabernat/pangeo-osn-demo/34e294b/?urlpath=git-pull?repo=https://github.com/rabernat/pangeo-osn-demo%26amp%3Bbranch=main%26amp%3Burlpath=tree/pangeo-osn-demo/osn_pangeo.ipynb)

In [None]:
# run before executing to fix the coordinates error
# https://github.com/holoviz/hvplot/issues/603
import warnings
warnings.simplefilter("ignore")
import numpy as np
import xarray as xr
import hvplot.xarray
from dask.diagnostics import ProgressBar
url = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/swot_adac/eNATL60/Region01/surface_hourly/fma.zarr"
ds = xr.open_zarr(url, consolidated=True)
ds = ds.assign_coords({'x': ('x', np.arange(len(ds.x))),
                       'y': ('y', np.arange(len(ds.y)))})

Most of what I am going to say is here: 

R. P. Abernathey et al., "Cloud-Native Repositories for Big Scientific Data," in Computing in Science & Engineering, vol. 23, no. 2, pp. 26-35, 1 March-April 2021. **https://doi.org/10.1109/MCSE.2021.3059437**

<img width="60%" src="https://raw.githubusercontent.com/rabernat/pangeo-osn-demo/main/images/ieee_paper_header.png">

<img src="https://raw.githubusercontent.com/pangeo-data/branding/master/logo/v_small_pangeo_logo.png">

## <http://pangeo.io>

<img class="float-right" style="float: right;" width="50%" src="https://raw.githubusercontent.com/rabernat/pangeo-osn-demo/main/images/pangeo_logos.png">

- Grass-roots collaboration between scientists, software developers around open-source tools for
  - big data processing
  - visualization
  - machine learning
- Foundational support from NSF EarthCube
- International partners, industry connections

<img style="float: right;" src="https://ndownloader.figshare.com/files/22017009">

# Data Access Modes in Science


- **Download model** is by far the most prevalent:
  _download and organize data on local computers in order to make it ready for computing_
  - Dependency of analysis codes on local filesystem paths is a barrier to collaboration / reproducibility
  - Inefficient / duplicative (same datasets are downloaded and stored repeatedly)
  - Can’t scale to modern data needs
  - Limits inclusion and knowledge transfer
- **Cloud-native model**: bring compute to the data

# Pangeo Infrastructure

<img src="https://raw.githubusercontent.com/rabernat/pangeo-osn-demo/main/images/pangeo_cloud_infrastructure.png">


<img style="float: right; border: 1px solid gray;" width="50%" src="https://raw.githubusercontent.com/rabernat/pangeo-osn-demo/main/images/google_cloud_cmip6_blog_screenshot.png">

# Pangeo / ESGF CMIP6 Public Dataset

- https://cloud.google.com/blog/products/data-analytics/new-climate-model-data-now-google-public-datasets
- https://medium.com/pangeo/cmip6-in-the-cloud-five-ways-96b177abe396
- https://pangeo-data.github.io/pangeo-cmip6-cloud/

1 PB and growing.



# The Pangeo Software Stack

<img src="https://raw.githubusercontent.com/rabernat/pangeo-osn-demo/main/images/pangeo_cloud_stack.png">


# Data Model and Storage: Xarray + Zarr

Example dataset from the [NASA / CNES SWOT Adopt-A-Crossover field campaign](https://www.clivar.org/news/swot-%E2%80%98adopt-crossover%E2%80%99-consortium-has-been-endorsed-clivar)

In [None]:
import xarray as xr
url = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/swot_adac/eNATL60/Region01/surface_hourly/fma.zarr"
xr.open_zarr(url, consolidated=True)

<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />

<h1>Interactive Visualization:
    <img style="display: inline;" width="15%" src="https://holoviews.org/_static/logo_horizontal.png">
</h1>

In [None]:
ds.sosstsst.hvplot.image('x', 'y', clim=(5, 25), rasterize=True, dynamic=True, width=800, height=450,
                         widget_type='scrubber', widget_location='bottom', cmap='magma')

<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />

<h1>Distributed Computing:
    <img style="display: inline;" width="15%" src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg">
</h1>

In [None]:
mean_sst = ds.sosstsst.mean(dim="time_counter")
mean_sst.data.visualize(optimize_graph=True, color="order",
                 cmap="viridis", node_attr={"penwidth": "2"})

In [None]:
from dask.diagnostics import ProgressBar
with ProgressBar():
    mean_sst.compute()

<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />

<img width="50%" style="float: right;" src="https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/5992/9387473/9354557/abern3-3059437-large.gif">

# Distributed Read Performance

- Compared OSN throughput to Google Cloud storage in _US-CENTRAL1_ 
- For modest levels of concurrent reads (< 50), **OSN was faster**
- Read throughput of **5 GB/s**
- Nearly 2 orders of magnitude faster than legacy data portal

# Don't Be Stupid: Use the Commercial Cloud 👍

### (Except for data storage)

- Commercial cloud object storage (AWS S3, GCS, etc.) is ~\$250K/PB year 
  - Public dataset programs can help, but not all scientifically useful data qualify
  - Scientists have an existential fear of losing control of their data to the cloud companies
- Wasabi is \$50K / PB year
  - ..but comes with heavy egress limits
- OSN is \$30K / PB year (assuming 5-year lifespan) with comparable in-cloud performance
  

# Challenges / Questions for OSN

- **Egress** - Hybrid approach will require high-volume of traffic btw OSN pods and commercial cloud. _We must find a way to bypass egress costs._
- **Governance** - Who owns OSN pods? Who decides about storage allocations? 
- **Software Stack** - How can scientists transition from file-based to object-based data access? Maybe Pangeo stack can be helpful here...