# Compressed N-Dim Files with Zarr 

#### This notebook creates a Zarr store, group, and datasets <br/> For more info, see the links below
<br/> Zarr module tutorial<br/>
https://zarr.readthedocs.io/en/stable/tutorial.html <br/>
Talk on Zarr by Alistair Miles <br/>
https://youtu.be/qyJXBlrdzBs

In [None]:
import zarr
import numpy as np
import pandas as pd
import sys
import netCDF4
from datetime import datetime,timedelta
import xarray as xr
from numcodecs import Blosc
import dask as ds

## Create a Zarr Group

Path to store the zarr file
<br/>Right now we are just using basic dir storage in Horel-group4



In [None]:
dataloc = '/uufs/chpc.utah.edu/common/home/horel-group4/tmccorkle/imerg_early/'
store = zarr.DirectoryStore(dataloc+'imerg.zarr')

In [None]:
root = zarr.group(store=store, overwrite=True)
root

Within the root group, create a group for each variable

We care mostly about the precipitationCal estimates, so thats what our example will show

In [None]:
pcal = root.create_group('precipitationCal')

Create a precipitationCal dataset for each year 

Chunks are determined by how we will use this data.  We will likely want time series, so we are only chunking along the lat/lon dimensions of the data. <br/> Dims: (lon,lat,time)

In [None]:
z0 = pcal.create_dataset('2014',shape=(3600,1800,92),chunks=(400,200,92),dtype='i4')
z1 = pcal.create_dataset('2015',shape=(3600,1800,92),chunks=(400,200,92),dtype='i4')
z2 = pcal.create_dataset('2016',shape=(3600,1800,92),chunks=(400,200,92),dtype='i4')
z3 = pcal.create_dataset('2017',shape=(3600,1800,92),chunks=(400,200,92),dtype='i4')
z4 = pcal.create_dataset('2018',shape=(3600,1800,92),chunks=(400,200,92),dtype='i4')
z5 = pcal.create_dataset('2019',shape=(3600,1800,92),chunks=(400,200,92),dtype='i4')

#### We can look at our zarr file heiarchy using root.tree()

In [46]:
%%html
<img src="http://home.chpc.utah.edu/~u1014509/root.png",width=60,height=60>"

---
## Load in all IMERG-E daily files for a year

*Xarray's open_mfdataset is dependent on Dask.

This allows you to open multiple files at once and create a single dataset.

In [None]:
loc = '/uufs/chpc.utah.edu/common/home/horel-group4/tmccorkle/imerg_early/2014v6/'

ds = xr.open_mfdataset(loc+'*.nc4',combine='by_coords')

Get size and info for the dataset

In [None]:
print('ds size in GB {:0.2f}\n'.format(ds.nbytes / 1e9))
ds.info

#### Here we are looking at the Dask Arrays and getting the entire precipitaitonCal variable

In [None]:
for name, da in ds.data_vars.items():
    print(name, da.data)
    
precipitationcal = ds.variables['precipitationCal'].values

# Fill nan values with an arbitrary negative number
pcal = np.nan_to_num(precipitationcal,nan=-1.0)

In [None]:
# Just some dimension reshuffling 
zarray = np.zeros((3600,1800,92))
for i in range(0,92):
    zarray[:,:,i] = pcal[i,:,:]
    
zarray = np.round(zarray,decimals=1)

#### Place the data inside the zarr dataset we made. The chunks should be the same. Here, we are using the Blosc compressor and filters. 

##### Requires the numcodecs module for compression capability

In [None]:
compressor = Blosc(cname='zstd',clevel=3,shuffle=Blosc.BITSHUFFLE)

In [None]:
z0[:] = zarr.array(zarray, chunks=(400,200,92), compressor=compressor)
z0.info

#### Dataset info after compression

In [47]:
%%html
<img src="http://home.chpc.utah.edu/~u1014509/datinfo.png",width=60,height=60>"