# Task 1. Converting NetCDF Files to ZARR Format
## Q1. What would be your chunking strategy to make the resulting file more efficiently accessed
- Make chunks similar in shape to the entire array so that the number of chunks to read a row or column is about the same to average access time across the dimensions.
- If data access is more likely to be read across one dimension then you can chunk across the other dimensions.  
- The chunk size should ideally be a multiple of the disk block size of the target filesystem to minimize reading extra disk blocks per chunk.
- Larger chunks can be more efficient, however should take account of the machine memory size as large blocks are harder to load into memory gaps.
## Q2. This dataset is updated daily. If you would like to automate the ZARR conversion to update daily, what will be your strategy?
- Schedule an incremental job daily to check for and append new source data files to the target.

In [None]:
## Install Python dependencies

In [19]:
pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Convert 2023 NetCDF files to ZARR

In [5]:
from siphon.catalog import TDSCatalog
import xarray as xr
import fsspec

def filter_dict_by_extension(input_dict, extensions):
    return {key: input_dict[key] for key in input_dict if key.endswith(extensions)}

# List the catalog references in P1D
cat_p1d = TDSCatalog('https://thredds.aodn.org.au/thredds/catalog/IMOS/SRS/OC/gridded/aqua/P1D/catalog.xml')
print(list(cat_p1d.catalog_refs))

# Get the '2023' catalog reference and print it out
catref_2023 = cat_p1d.catalog_refs['2023']
print(f"name={catref_2023.name}, href={catref_2023.href}, title={catref_2023.title}")

# Follow the '2023' catalog which returns a new TDSCatalog
cat_2023 = catref_2023.follow()
print(list(cat_2023.catalog_refs))

# Get the '01' catalog reference and print it out
catref_2023_01 = cat_2023.catalog_refs['01']
print(f"name={catref_2023_01.name}, href={catref_2023_01.href}, title={catref_2023_01.title}")

# Follow the '01' catalog which returns a new TDSCatalog
cat_2023_01 = catref_2023_01.follow()

# List the catalog refs, which is empty as there are none in month catalog
print(list(cat_2023_01.catalog_refs))

# List the services
for compound_service in cat_2023_01.services:
    print(f"name={compound_service.name}, service_type={compound_service.service_type}")
    for simple_service in compound_service.services:
        print(f"\tname={simple_service.name}, service_type={simple_service.service_type}, access_urls={simple_service.access_urls}")

# Filter the dataset for specific extension: .aust.ipar.nc
filtered_dataset = filter_dict_by_extension(cat_2023_01.datasets, '.aust.ipar.nc')
print(filtered_dataset)

# Get the list of 2023-01 NetCDF data files
nc_files_list = [fsspec.open(ds.access_urls['HTTPServer']).open() for ds in filtered_dataset.values()]
# nc_files_list = list(ds. for ds in filtered_dataset.values())
print(nc_files_list)

ds = xr.open_mfdataset(nc_files_list, engine='h5netcdf', parallel=True, chunks={'time': 1})
display(ds)

ds.to_zarr('data.zarr', mode='w')

['2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023', '2024']
name=, href=https://thredds.aodn.org.au/thredds/catalog/IMOS/SRS/OC/gridded/aqua/P1D/2023/catalog.xml, title=2023
['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
name=, href=https://thredds.aodn.org.au/thredds/catalog/IMOS/SRS/OC/gridded/aqua/P1D/2023/01/catalog.xml, title=01
[]
name=regGriddedServices, service_type=Compound
	name=dapService, service_type=OPENDAP, access_urls={}
	name=httpService, service_type=HTTPServer, access_urls={}
	name=wmsService, service_type=WMS, access_urls={}
{'A.P1D.20230101T053000Z.aust.ipar.nc': A.P1D.20230101T053000Z.aust.ipar.nc, 'A.P1D.20230102T053000Z.aust.ipar.nc': A.P1D.20230102T053000Z.aust.ipar.nc, 'A.P1D.20230103T053000Z.aust.ipar.nc': A.P1D.20230103T053000Z.aust.ipar.nc, 'A.P1D.20230104T053000Z.aust.ipar.nc': A.P1D.20230104T053000Z.aust.ip

Unnamed: 0,Array,Chunk
Bytes,8.09 GiB,10.69 MiB
Shape,"(31, 7001, 10001)","(1, 1401, 2001)"
Dask graph,775 chunks in 63 graph layers,775 chunks in 63 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 8.09 GiB 10.69 MiB Shape (31, 7001, 10001) (1, 1401, 2001) Dask graph 775 chunks in 63 graph layers Data type float32 numpy.ndarray",10001  7001  31,

Unnamed: 0,Array,Chunk
Bytes,8.09 GiB,10.69 MiB
Shape,"(31, 7001, 10001)","(1, 1401, 2001)"
Dask graph,775 chunks in 63 graph layers,775 chunks in 63 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


<xarray.backends.zarr.ZarrStore at 0x7fb528104ac0>