# Reading and Writing data on cloud object storage (AWS S3)
Reading from and writing to S3 object storage is a bit different than regular filesystems.   Here we access public read buckets, requester pays buckets and write to a private bucket for Pangeo users.  We will make much use of `fsspec`, which offers filesystem interfaces to S3 (also HTTPS, FTP and many others) in Python.

In [None]:
import fsspec
import pandas as pd

#### Explore items on a public-read S3 bucket

In [None]:
fs = fsspec.filesystem('s3', anon=True)
fs.ls('anaconda-public-datasets')

You can also use glob to explore items

In [None]:
fs.glob('anaconda-public-datasets/*/*.csv')

#### Read a CSV file from a public read bucket

In [None]:
infile = fsspec.open("s3://anaconda-public-datasets/iris/iris.csv", 
                     mode='rt', anon=True)
with infile as f:
    df = pd.read_csv(f)
df

#### Writing data to S3 buckets

To write data, you must first set up your AWS credentials.  Open a terminal and type `aws configure --profile esip-qhub` to run a script which will ask for your AWS credentials (the `aws_access_key_id` and `aws_secret_access_key` provided to you. These will be stored in `/home/jovyan/.aws/credentials` with the profile name `esip-qhub`.  Make sure you don't commit or share that file anywhere!).  

Once the credentials are in place, you should be able to write to buckets where your credentials have permission.   On the ESIP qhub, this is the `s3://esip-qhub` bucket. 

Write CSV 

In [None]:
outfile = fsspec.open(f"s3://esip-qhub/usgs/testing/iris.csv", 
                      mode='wt', profile='esip-qhub')

with outfile as f:
    df.to_csv(f)

List items in a bucket

In [None]:
fs = fsspec.filesystem('s3', anon=False, profile='esip-qhub')
fs.ls(f'esip-qhub/usgs/testing/')

#### The rest of the examples will use xarray, which follows the NetCDF data model

In [None]:
import xarray as xr

#### Open a NetCDF file from an HTTP server

In [None]:
infile = fsspec.open('simplecache::https://geoport.usgs.esipfed.org/erddap/files/8544pcs-cal_z3/8544pcs-cal_z3.nc')

In [None]:
ds = xr.open_dataset(infile.open(), engine='h5netcdf')

In [None]:
ds

In [None]:
outfile = fsspec.open('simplecache::s3://esip-qhub/usgs/testing/8544pcs-cal_z3.nc', 
                      mode='wb', s3=dict(profile='esip-qhub'))
with outfile as f:
    ds.to_netcdf(f)

#### Read NetCDF data from a bucket

In [None]:
ncfile = fsspec.open(f's3://esip-qhub/usgs/testing/8544pcs-cal_z3.nc')
ds = xr.open_dataset(ncfile.open())
ds

#### Read NetCDF data from THREDDS OPeNDAP Service  

In [None]:
ds = xr.open_dataset('http://geoport.usgs.esipfed.org/thredds/dodsC/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf')
ds

Visualation interlude: plot a time range of data with hvplot

In [None]:
import hvplot.xarray

ds['T_20'].sel(time=slice('1982-10-01','1982-10-31')).hvplot(grid=True)

#### Read NetCDF data from ERDDAP's Tabledap Service

In [None]:
ds = xr.open_dataset('http://erddap.sensors.ioos.us/erddap/tabledap/gov_usgs_cmgp_buzz_bay_265')
ds

#### Read NetCDF data from ERDDAP's griddap Service

In [None]:
url = 'https://geoport.usgs.esipfed.org/erddap/griddap/adcp_grid_5d6e_e2f9_148d'

ds = xr.open_dataset(url)

In [None]:
ds

In [None]:
ds['CS_300']

In [None]:
ds['CS_300'].sel(time=slice('2009-10-01T12:00:00','2009-10-14T12:00:00')).isel(altitude=[0,1,-1]).hvplot(x='time', grid=True)

List items on a Requester Pays bucket

In [None]:
fs = fsspec.filesystem('s3', anon=False, requester_pays=True)
fs.ls('esip-qhub/noaa/nwm/')

#### Read Zarr dataset from a Requester Pays S3 bucket

In Zarr datasets, each chunk is a separate object, so instead of opening a file, we use a mapper

In [None]:
ds = xr.open_zarr(fsspec.get_mapper('s3://esip-qhub/noaa/nwm', 
                            anon=False, requester_pays=True), consolidated=True)

#### Write Zarr dataset to S3 using LocalCluster

In [None]:
from dask.distributed import Client
client = Client()
client

In [None]:
s3store = fsspec.get_mapper(f's3://esip-qhub/usgs/testing/zarr_test', 
                                    anon=False, profile='esip-qhub')

Write first time step of Zarr dataset to Bucket

In [None]:
%%time
ds.isel(time=0).load().to_zarr(store=s3store, mode='w', consolidated=True)   #fails without .load()

In [None]:
client.close()   # close the LocalCluster client 

### Write Zarr Dataset to S3 using a remote cluster
Here we spin up a Qhub Dask Gateway cluster.

In [None]:
from dask_gateway import Gateway
from dask.distributed import Client
gateway = Gateway()
# see Gateway options to use in new_cluster by doing: gateway.cluster_options()
cluster = gateway.new_cluster(environment='pangeo', profile='Pangeo Worker')  
cluster.scale(4)
client = Client(cluster)
cluster

In [None]:
%%time
ds.isel(time=0).load().to_zarr(store=s3store, mode='w', consolidated=True) #fails without .load()

In [None]:
client.close(); cluster.close()