Problems faced while storing onto Zarr store using ABSStore #528

dokooh · 2019-12-12T12:47:51Z

# Your code here

import zarr
from azure.storage.blob import BlockBlobService

store = zarr.ABSStore(container='zarrstoreall', prefix='zarrstoreall',account_name='xxxx',account_key='xxxx', blob_service_kwargs={'is_emulated': False})

compressor = zarr.Blosc(cname='zstd', clevel=3)
encoding = {vname: {'compressor': compressor} for vname in ds.data_vars}
ds.to_zarr(store=store, encoding=encoding, consolidated=True)

Problem description

I'm trying to use ABSStore to store a large XArray dataset onto a zarr store using blob store. (see the code in previous section). I am facing two issues currently:

I am getting first some sort of network error when loading "certain" variables into the store:

After some time passing I get this error:

Needless to say with relatively smaller sizes of XArray datasets I did not face these issues.

I appreciate your kind attention.

Version and installation information

Please provide the following:

Value of zarr.__version__ = '2.3.2'
Value of numcodecs.__version__ = '0.6.4'
Version of Python interpreter = Python 3.7.3
Operating system (Linux/Windows/Mac) = Databricks Runtime Version
6.1 (includes Apache Spark 2.4.4, Scala 2.11)
How Zarr was installed (e.g., "using pip into virtual environment", or "using conda")
!pip install zarr

Also, if you think it might be relevant, please provide the output from pip freeze or
conda env export depending on which was used to install Zarr.
pip freeze output:
adal==1.2.2
asciitree==0.3.3
asn1crypto==0.24.0
azure==4.0.0
azure-applicationinsights==0.1.0
azure-batch==4.1.3
azure-common==1.1.23
azure-cosmosdb-nspkg==2.0.2
azure-cosmosdb-table==1.0.6
azure-datalake-store==0.0.48
azure-eventgrid==1.3.0
azure-graphrbac==0.40.0
azure-keyvault==1.1.0
azure-loganalytics==0.1.0
azure-mgmt==4.0.0
azure-mgmt-advisor==1.0.1
azure-mgmt-applicationinsights==0.1.1
azure-mgmt-authorization==0.50.0
azure-mgmt-batch==5.0.1
azure-mgmt-batchai==2.0.0
azure-mgmt-billing==0.2.0
azure-mgmt-cdn==3.1.0
azure-mgmt-cognitiveservices==3.0.0
azure-mgmt-commerce==1.0.1
azure-mgmt-compute==4.6.2
azure-mgmt-consumption==2.0.0
azure-mgmt-containerinstance==1.5.0
azure-mgmt-containerregistry==2.8.0
azure-mgmt-containerservice==4.4.0
azure-mgmt-cosmosdb==0.4.1
azure-mgmt-datafactory==0.6.0
azure-mgmt-datalake-analytics==0.6.0
azure-mgmt-datalake-nspkg==3.0.1
azure-mgmt-datalake-store==0.5.0
azure-mgmt-datamigration==1.0.0
azure-mgmt-devspaces==0.1.0
azure-mgmt-devtestlabs==2.2.0
azure-mgmt-dns==2.1.0
azure-mgmt-eventgrid==1.0.0
azure-mgmt-eventhub==2.6.0
azure-mgmt-hanaonazure==0.1.1
azure-mgmt-iotcentral==0.1.0
azure-mgmt-iothub==0.5.0
azure-mgmt-iothubprovisioningservices==0.2.0
azure-mgmt-keyvault==1.1.0
azure-mgmt-loganalytics==0.2.0
azure-mgmt-logic==3.0.0
azure-mgmt-machinelearningcompute==0.4.1
azure-mgmt-managementgroups==0.1.0
azure-mgmt-managementpartner==0.1.1
azure-mgmt-maps==0.1.0
azure-mgmt-marketplaceordering==0.1.0
azure-mgmt-media==1.0.0
azure-mgmt-monitor==0.5.2
azure-mgmt-msi==0.2.0
azure-mgmt-network==2.7.0
azure-mgmt-notificationhubs==2.1.0
azure-mgmt-nspkg==3.0.2
azure-mgmt-policyinsights==0.1.0
azure-mgmt-powerbiembedded==2.0.0
azure-mgmt-rdbms==1.9.0
azure-mgmt-recoveryservices==0.3.0
azure-mgmt-recoveryservicesbackup==0.3.0
azure-mgmt-redis==5.0.0
azure-mgmt-relay==0.1.0
azure-mgmt-reservations==0.2.1
azure-mgmt-resource==2.2.0
azure-mgmt-scheduler==2.0.0
azure-mgmt-search==2.1.0
azure-mgmt-servicebus==0.5.3
azure-mgmt-servicefabric==0.2.0
azure-mgmt-signalr==0.1.1
azure-mgmt-sql==0.9.1
azure-mgmt-storage==2.0.0
azure-mgmt-subscription==0.2.0
azure-mgmt-trafficmanager==0.50.0
azure-mgmt-web==0.35.0
azure-nspkg==3.0.2
azure-servicebus==0.21.1
azure-servicefabric==6.3.0.0
azure-servicemanagement-legacy==0.20.6
azure-storage-blob==1.5.0
azure-storage-common==1.4.2
azure-storage-file==1.4.0
azure-storage-queue==1.4.0
backcall==0.1.0
boto==2.49.0
boto3==1.9.162
botocore==1.12.163
certifi==2019.3.9
cffi==1.12.2
cftime==1.0.4.2
chardet==3.0.4
cryptography==2.6.1
cycler==0.10.0
Cython==0.29.6
dask==2.9.0
decorator==4.4.0
docutils==0.14
fasteners==0.15
fsspec==0.6.1
idna==2.8
ipython==7.4.0
ipython-genutils==0.2.0
isodate==0.6.0
jedi==0.13.3
jmespath==0.9.4
kiwisolver==1.1.0
koalas==0.23.0
locket==0.2.0
matplotlib==3.0.3
monotonic==1.5
msrest==0.6.10
msrestazure==0.6.2
netCDF4==1.5.3
numcodecs==0.6.4
numpy==1.16.2
oauthlib==3.1.0
pandas==0.24.2
parso==0.3.4
partd==1.1.0
patsy==0.5.1
pexpect==4.6.0
pickleshare==0.7.5
prompt-toolkit==2.0.9
psycopg2==2.7.6.1
ptyprocess==0.6.0
pyarrow==0.13.0
pycparser==2.19
pycurl==7.43.0
Pygments==2.3.1
pygobject==3.20.0
PyJWT==1.7.1
pyOpenSSL==19.0.0
pyparsing==2.4.2
PySocks==1.6.8
python-apt==1.1.0b1+ubuntu0.16.4.5
python-dateutil==2.8.0
pytz==2018.9
requests==2.21.0
requests-oauthlib==1.3.0
s3transfer==0.2.1
scikit-learn==0.20.3
scipy==1.2.1
seaborn==0.9.0
six==1.12.0
ssh-import-id==5.5
statsmodels==0.9.0
toolz==0.10.0
traitlets==4.3.2
unattended-upgrades==0.1
urllib3==1.24.1
virtualenv==16.4.1
wcwidth==0.1.7
xarray==0.14.1
zarr==2.3.2

The text was updated successfully, but these errors were encountered:

alimanfoo · 2019-12-12T13:50:41Z

This is a bit of a guess, but are you sure all of the input netcdf files are there? Errors suggest that during attempt to read netcdf input something is requested which does not exist.

…

On Thu, 12 Dec 2019, 12:47 Nima Dokoohaki, ***@***.***> wrote: # Your code here import zarrfrom azure.storage.blob import BlockBlobService store = zarr.ABSStore(container='zarrstoreall', prefix='zarrstoreall',account_name='xxxx',account_key='xxxx', blob_service_kwargs={'is_emulated': False}) compressor = zarr.Blosc(cname='zstd', clevel=3) encoding = {vname: {'compressor': compressor} for vname in ds.data_vars} ds.to_zarr(store=store, encoding=encoding, consolidated=True) Problem description I'm trying to use ABSStore to store a large XArray onto a zarr store using blob store. (see the code in previous section). I am facing two issues currently: 1. I am getting first some sort of network error when loading "certain" variables into the store: [image: image] <https://user-images.githubusercontent.com/164987/70712978-5c4f2280-1ce5-11ea-8fbe-cadffe2d20aa.png> 2. After some time passing I get this error: [image: image] <https://user-images.githubusercontent.com/164987/70712591-6290cf00-1ce4-11ea-974c-df2615ea0a0a.png> Needless to say with relatively smaller sizes of XArray datasets I did not face these issues. I appreciate your kind attention. Version and installation information Please provide the following: - Value of zarr.__version__ = '2.3.2' - Value of numcodecs.__version__ = '0.6.4' - Version of Python interpreter = Python 3.7.3 - Operating system (Linux/Windows/Mac) = Databricks Runtime Version 6.1 (includes Apache Spark 2.4.4, Scala 2.11) - How Zarr was installed (e.g., "using pip into virtual environment", or "using conda") !pip install zarr Also, if you think it might be relevant, please provide the output from pip freeze or conda env export depending on which was used to install Zarr. pip freeze output: adal==1.2.2 asciitree==0.3.3 asn1crypto==0.24.0 azure==4.0.0 azure-applicationinsights==0.1.0 azure-batch==4.1.3 azure-common==1.1.23 azure-cosmosdb-nspkg==2.0.2 azure-cosmosdb-table==1.0.6 azure-datalake-store==0.0.48 azure-eventgrid==1.3.0 azure-graphrbac==0.40.0 azure-keyvault==1.1.0 azure-loganalytics==0.1.0 azure-mgmt==4.0.0 azure-mgmt-advisor==1.0.1 azure-mgmt-applicationinsights==0.1.1 azure-mgmt-authorization==0.50.0 azure-mgmt-batch==5.0.1 azure-mgmt-batchai==2.0.0 azure-mgmt-billing==0.2.0 azure-mgmt-cdn==3.1.0 azure-mgmt-cognitiveservices==3.0.0 azure-mgmt-commerce==1.0.1 azure-mgmt-compute==4.6.2 azure-mgmt-consumption==2.0.0 azure-mgmt-containerinstance==1.5.0 azure-mgmt-containerregistry==2.8.0 azure-mgmt-containerservice==4.4.0 azure-mgmt-cosmosdb==0.4.1 azure-mgmt-datafactory==0.6.0 azure-mgmt-datalake-analytics==0.6.0 azure-mgmt-datalake-nspkg==3.0.1 azure-mgmt-datalake-store==0.5.0 azure-mgmt-datamigration==1.0.0 azure-mgmt-devspaces==0.1.0 azure-mgmt-devtestlabs==2.2.0 azure-mgmt-dns==2.1.0 azure-mgmt-eventgrid==1.0.0 azure-mgmt-eventhub==2.6.0 azure-mgmt-hanaonazure==0.1.1 azure-mgmt-iotcentral==0.1.0 azure-mgmt-iothub==0.5.0 azure-mgmt-iothubprovisioningservices==0.2.0 azure-mgmt-keyvault==1.1.0 azure-mgmt-loganalytics==0.2.0 azure-mgmt-logic==3.0.0 azure-mgmt-machinelearningcompute==0.4.1 azure-mgmt-managementgroups==0.1.0 azure-mgmt-managementpartner==0.1.1 azure-mgmt-maps==0.1.0 azure-mgmt-marketplaceordering==0.1.0 azure-mgmt-media==1.0.0 azure-mgmt-monitor==0.5.2 azure-mgmt-msi==0.2.0 azure-mgmt-network==2.7.0 azure-mgmt-notificationhubs==2.1.0 azure-mgmt-nspkg==3.0.2 azure-mgmt-policyinsights==0.1.0 azure-mgmt-powerbiembedded==2.0.0 azure-mgmt-rdbms==1.9.0 azure-mgmt-recoveryservices==0.3.0 azure-mgmt-recoveryservicesbackup==0.3.0 azure-mgmt-redis==5.0.0 azure-mgmt-relay==0.1.0 azure-mgmt-reservations==0.2.1 azure-mgmt-resource==2.2.0 azure-mgmt-scheduler==2.0.0 azure-mgmt-search==2.1.0 azure-mgmt-servicebus==0.5.3 azure-mgmt-servicefabric==0.2.0 azure-mgmt-signalr==0.1.1 azure-mgmt-sql==0.9.1 azure-mgmt-storage==2.0.0 azure-mgmt-subscription==0.2.0 azure-mgmt-trafficmanager==0.50.0 azure-mgmt-web==0.35.0 azure-nspkg==3.0.2 azure-servicebus==0.21.1 azure-servicefabric==6.3.0.0 azure-servicemanagement-legacy==0.20.6 azure-storage-blob==1.5.0 azure-storage-common==1.4.2 azure-storage-file==1.4.0 azure-storage-queue==1.4.0 backcall==0.1.0 boto==2.49.0 boto3==1.9.162 botocore==1.12.163 certifi==2019.3.9 cffi==1.12.2 cftime==1.0.4.2 chardet==3.0.4 cryptography==2.6.1 cycler==0.10.0 Cython==0.29.6 dask==2.9.0 decorator==4.4.0 docutils==0.14 fasteners==0.15 fsspec==0.6.1 idna==2.8 ipython==7.4.0 ipython-genutils==0.2.0 isodate==0.6.0 jedi==0.13.3 jmespath==0.9.4 kiwisolver==1.1.0 koalas==0.23.0 locket==0.2.0 matplotlib==3.0.3 monotonic==1.5 msrest==0.6.10 msrestazure==0.6.2 netCDF4==1.5.3 numcodecs==0.6.4 numpy==1.16.2 oauthlib==3.1.0 pandas==0.24.2 parso==0.3.4 partd==1.1.0 patsy==0.5.1 pexpect==4.6.0 pickleshare==0.7.5 prompt-toolkit==2.0.9 psycopg2==2.7.6.1 ptyprocess==0.6.0 pyarrow==0.13.0 pycparser==2.19 pycurl==7.43.0 Pygments==2.3.1 pygobject==3.20.0 PyJWT==1.7.1 pyOpenSSL==19.0.0 pyparsing==2.4.2 PySocks==1.6.8 python-apt==1.1.0b1+ubuntu0.16.4.5 python-dateutil==2.8.0 pytz==2018.9 requests==2.21.0 requests-oauthlib==1.3.0 s3transfer==0.2.1 scikit-learn==0.20.3 scipy==1.2.1 seaborn==0.9.0 six==1.12.0 ssh-import-id==5.5 statsmodels==0.9.0 toolz==0.10.0 traitlets==4.3.2 unattended-upgrades==0.1 urllib3==1.24.1 virtualenv==16.4.1 wcwidth==0.1.7 xarray==0.14.1 zarr==2.3.2 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#528?email_source=notifications&email_token=AAFLYQQKRK6GKXP5NDWXPQLQYIXHRA5CNFSM4JZ577ZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IABKMLQ>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFLYQSOQYKXQDVLIZVRHC3QYIXHRANCNFSM4JZ577ZA> .

tjcrone · 2019-12-12T14:39:13Z

I believe the first error is actually a warning, and occurs when Zarr looks for metadata files that do not exist. This has been solved in newer versions of the Azure SDK. I would try upgrading azure-storage-blob to v2.1.

It's worth noting that while investigating this I learned that there is a major new release of the Azure SDK that looks like it will break ABSStore entirely. We are going to need to figure out how to deal with this probably soon. It's not obvious how we are going to deal with two versions of the SDK that are essentially incompatible. I will probably start a new issue to work on this eventually.

dokooh · 2019-12-12T15:12:28Z

Thanks @alimanfoo I use mfdataset method and I get some warnings during the import which should suggest that:

/local_disk0/tmp/1576146109393-0/PythonShell.py:4: FutureWarning: In xarray version 0.15 the default behaviour of open_mfdataset
will change. To retain the existing behavior, pass
combine='nested'. To use future default behavior, pass
combine='by_coords'. See
http://xarray.pydata.org/en/stable/combining.html#combining-multi

import errno
/databricks/python/lib/python3.7/site-packages/xarray/backends/api.py:933: FutureWarning: The datasets supplied have global dimension coordinates. You may want
to use the new combine_by_coords function (or the
combine='by_coords' option to open_mfdataset) to order the datasets
before concatenation. Alternatively, to continue concatenating based
on the order the datasets are supplied in future, please use the new
combine_nested function (or the combine='nested' option to
open_mfdataset). from_openmfds=True,

Would this suggest that some of the files were not loaded I guess into Xarray. I will try experimenting with the combine options to check this.

jakirkham · 2019-12-12T16:33:35Z

Maybe not related, but did you see PR ( #526 )?

shikharsg · 2019-12-12T21:45:13Z

I believe the first error is actually a warning, and occurs when Zarr looks for metadata files that do not exist. This has been solved in newer versions of the Azure SDK. I would try upgrading azure-storage-blob to v2.1.

It's worth noting that while investigating this I learned that there is a major new release of the Azure SDK that looks like it will break ABSStore entirely. We are going to need to figure out how to deal with this probably soon. It's not obvious how we are going to deal with two versions of the SDK that are essentially incompatible. I will probably start a new issue to work on this eventually.

Agree with @tjcrone here, the first warning goes away after updating to the newer version.

As for the above error, I have faced various errors, mostly out of memory error(so it's worth monitoring the memory of your device/vm while doing the above) but also the one above while transferring large amounts of netCDF data to zarr. My solution was to transfer the data to zarr "in parts". It is easily possible now with xarray's new "append" feature for zarr. You can use ds.to_zarr with mode='a' and also provide the dimension along which the data will be appended. See here: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_zarr.html or here: pydata/xarray#2706

dokooh · 2020-01-09T14:59:34Z

Hi,
we did some investigation and realized that we have some NaN values in the dataset which we were not aware of. 300 rows in total in a full year chunk. Could all this be caused by NaN/Inf values?

alimanfoo · 2020-01-09T23:57:14Z

Hi,
we did some investigation and realized that we have some NaN values in the dataset which we were not aware of. 300 rows in total in a full year chunk. Could all this be caused by NaN/Inf values?

I would not have thought so, at least on the side of writing the zarr data, zarr should be ignorant to what the actual data values are, it will just write them.

But it's still unclear to me at least whether the errors are being generated during the read from netcdf or the write to zarr. The error messages suggest it's the read from netcdf that's triggering the error, but I may have misunderstood. Are you reading the netcdf data from ABS, or is that being read from a local file system? Apologies if I'm barking up the wrong tree.

dokooh · 2020-01-10T11:25:59Z

Hi,
we did some investigation and realized that we have some NaN values in the dataset which we were not aware of. 300 rows in total in a full year chunk. Could all this be caused by NaN/Inf values?

I would not have thought so, at least on the side of writing the zarr data, zarr should be ignorant to what the actual data values are, it will just write them.

But it's still unclear to me at least whether the errors are being generated during the read from netcdf or the write to zarr. The error messages suggest it's the read from netcdf that's triggering the error, but I may have misunderstood. Are you reading the netcdf data from ABS, or is that being read from a local file system? Apologies if I'm barking up the wrong tree.

Thanks for your kind follow up. We are reading NetCDF from local file system through Xray and then writing it onto Zarr.

shikharsg · 2020-03-26T16:59:27Z

Re: getting NaN values

I think I might have found out why this happens, as I ran into this myself.

There is a fill_value attribute in zarr, which zarr uses to fill out missing chunks (see here).

Xarray uses this same attribute as the _FillValue attribute(see here) for decoding using the CF conventions, which is something quite different from filling out missing chunks.

@zarr-developers/core-devs Is this a correct interpretation? If so where should this be fixed? In xarray or in zarr?

@dokooh I fixed this temporarily by giving mask_and_scale=False to xr.open_zarr

alimanfoo · 2020-03-26T17:37:04Z

Hi @shikharsg, thanks a lot for following up.

Yes zarr has a fill_value attribute in the array metadata, and this is used to fill out missing chunks.

I don't know the details of how xarray zarr backend uses the _FillValue attribute, defer to @rabernat and @jhamman.

I'm still not sure what the underlying problem is here. @shikharsg do you have a handle on where the problem is? Could you elaborate?

martindurant · 2020-03-26T18:24:40Z

If you are curious, would you be willing to trial abfs via fsspec, as allowed by #546 (implements only for nested store for now, so may not work for you) ? You would need https://github.com/dask/adlfs

shikharsg · 2020-03-26T18:27:55Z

So I had a large number of netCDF files which i transferred to zarr, back in October 2018. This was when xr.to_zarr with the append feature did not exist. So when it was finally released sometime mid last year, I had to manually build up the _ARRAY_DIMENSIONS attribute, as I have done below in the small reproducible example.

Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zarr
>>> import xarray as xr
>>> import numpy as np
>>> zarr.__version__, xr.__version__, np.__version__
('2.4.0', '0.15.0', '1.18.1')
>>>
>>> # in memory zarr array
>>> store = zarr.MemoryStore()
>>> grp = zarr.open_group(store)
>>> arr = zarr.open_array(store, path='foo', shape=(2, 10), fill_value=0.0, chunks=(1, 10))
>>> arr[0] = np.zeros((10,))
>>> arr[0] = np.ones((10,))
>>> 
>>> arr[:]
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
>>>
>>> # manually build up dimensions
>>> dim1 = zarr.open_array(store, path='dim1', shape=(2,))
>>> dim1[:] = np.array(list(range(1, 3)))
>>> dim1.attrs['_ARRAY_DIMENSIONS'] = ['dim1']
>>> dim2 = zarr.open_array(store, path='dim2', shape=(10,))
>>> dim2[:] = np.array(list(range(1, 11)))
>>> dim2.attrs['_ARRAY_DIMENSIONS'] = ['dim2']
>>> arr.attrs['_ARRAY_DIMENSIONS'] = ['dim1', 'dim2']
>>> 
>>> xr.open_zarr(store)['foo'].values
array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])
>>> 
>>> arr[:]
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

As you can see, zarr and xarray return different results. This is because xarray uses the fill_value attribute to replace all 0 values with nan

shikharsg · 2020-03-26T18:34:59Z

Perhaps this is a more appropriate example, where you can see zarr and xarray are using fill_value in different ways:

>>> zarr.__version__, xr.__version__, np.__version__
('2.4.0', '0.15.0', '1.18.1')
>>> 
>>> # in memory zarr array
>>> store = zarr.MemoryStore()
>>> grp = zarr.open_group(store)
>>> arr = zarr.open_array(store, path='foo', shape=(2, 10), fill_value=0.0, chunks=(1, 10))
>>> arr[0] = np.ones((10,))
>>> 
>>> arr[:]
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
>>> 
>>> # manually build up dimensions
>>> dim1 = zarr.open_array(store, path='dim1', shape=(2,))
>>> dim1[:] = np.array(list(range(1, 3)))
>>> dim1.attrs['_ARRAY_DIMENSIONS'] = ['dim1']
>>> dim2 = zarr.open_array(store, path='dim2', shape=(10,))
>>> dim2[:] = np.array(list(range(1, 11)))
>>> dim2.attrs['_ARRAY_DIMENSIONS'] = ['dim2']
>>> arr.attrs['_ARRAY_DIMENSIONS'] = ['dim1', 'dim2']
>>> 
>>> print(xr.open_zarr(store)['foo'].values)
[[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [nan nan nan nan nan nan nan nan nan nan]]
>>>

dazzag24 · 2020-03-26T19:07:51Z

If you are curious, would you be willing to trial abfs via fsspec, as allowed by #546 (implements only for nested store for now, so may not work for you) ? You would need https://github.com/dask/adlfs

@martindurant Is the adlfs work appropriate for files stored in standard Azure blob storage? From the description it looks like it targets the datalake storage?
Thanks

martindurant · 2020-03-26T19:09:47Z

It implements both datalake and blob. The latter is more recent, but I believe it is complete.

shikharsg · 2020-03-26T19:10:38Z

@martindurant is this in context to the current issue or just in general?

martindurant · 2020-03-26T19:12:11Z

In hindsight, it probably makes no difference to how the nan-value is inferred by zarr versus xarray ; so in general.

shikharsg · 2020-03-26T19:13:05Z

Would love to try it. Will try to check it out over the next couple of days.

dazzag24 · 2020-03-27T09:09:14Z

@martindurant Does it support SAS tokens? I see the example mentions only

STORAGE_OPTIONS={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}

martindurant · 2020-03-27T12:29:55Z

I have no idea what SAS tokens are :|
You would have to ask in an issue at adlfs. Perhaps this was a red herring!
Alternatively, if zarr's ABS store supports this auth and adlfs does not, it probably would be trivial to port the code.

shikharsg · 2020-03-30T15:55:18Z

So I had a large number of netCDF files which i transferred to zarr, back in October 2018. This was when xr.to_zarr with the append feature did not exist. So when it was finally released sometime mid last year, I had to manually build up the _ARRAY_DIMENSIONS attribute, as I have done below in the small reproducible example.

Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zarr
>>> import xarray as xr
>>> import numpy as np
>>> zarr.__version__, xr.__version__, np.__version__
('2.4.0', '0.15.0', '1.18.1')
>>>
>>> # in memory zarr array
>>> store = zarr.MemoryStore()
>>> grp = zarr.open_group(store)
>>> arr = zarr.open_array(store, path='foo', shape=(2, 10), fill_value=0.0, chunks=(1, 10))
>>> arr[0] = np.zeros((10,))
>>> arr[0] = np.ones((10,))
>>> 
>>> arr[:]
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
>>>
>>> # manually build up dimensions
>>> dim1 = zarr.open_array(store, path='dim1', shape=(2,))
>>> dim1[:] = np.array(list(range(1, 3)))
>>> dim1.attrs['_ARRAY_DIMENSIONS'] = ['dim1']
>>> dim2 = zarr.open_array(store, path='dim2', shape=(10,))
>>> dim2[:] = np.array(list(range(1, 11)))
>>> dim2.attrs['_ARRAY_DIMENSIONS'] = ['dim2']
>>> arr.attrs['_ARRAY_DIMENSIONS'] = ['dim1', 'dim2']
>>> 
>>> xr.open_zarr(store)['foo'].values
array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])
>>> 
>>> arr[:]
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

As you can see, zarr and xarray return different results. This is because xarray uses the fill_value attribute to replace all 0 values with nan

@rabernat @jhamman would be great to have your comments on this

dazzag24 · 2020-03-30T16:19:55Z

I have no idea what SAS tokens are :|

@martindurant FYI SAS tokens are a way of allowing access to your blob stores with more fine grained and potentially time limited access
https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview

For example you could give someone read-only access for period of one month via a token.

martindurant · 2020-03-30T16:44:53Z

OK, so some sort of delegation thing...
Still, the conversation of how to use these with the abfs fsspec backend, if it doesn't already should happen in the adlfs repo, pointing to the one in zarr, if that already works. I am just trying to rationalise the number of places that storage things are defined...

Chroxvi · 2020-07-13T11:36:18Z

It's worth noting that while investigating this I learned that there is a major new release of the Azure SDK that looks like it will break ABSStore entirely. We are going to need to figure out how to deal with this probably soon. It's not obvious how we are going to deal with two versions of the SDK that are essentially incompatible. I will probably start a new issue to work on this eventually

@tjcrone Did you ever create a new issue for this? I can't seem to find one. Unfortunately, the version 12 of the azure-storage-blob SDK does break ABSStore entirely.

jakirkham · 2021-06-01T19:17:58Z

cc @TomAugspurger (who may have thoughts here 🙂)

alimanfoo mentioned this issue Mar 27, 2020

Errors using to_zarr for an s3 store pydata/xarray#3831

Closed

jhamman mentioned this issue Sep 24, 2020

ABSStore ImportError #618

Closed

rabernat mentioned this issue May 3, 2021

Integration testing requirements pangeo-data/pangeo-integration-tests#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems faced while storing onto Zarr store using ABSStore #528

Problems faced while storing onto Zarr store using ABSStore #528

dokooh commented Dec 12, 2019 •

edited

Loading

alimanfoo commented Dec 12, 2019 via email

tjcrone commented Dec 12, 2019

dokooh commented Dec 12, 2019

jakirkham commented Dec 12, 2019

shikharsg commented Dec 12, 2019 •

edited

Loading

dokooh commented Jan 9, 2020

alimanfoo commented Jan 9, 2020

dokooh commented Jan 10, 2020 •

edited

Loading

shikharsg commented Mar 26, 2020

alimanfoo commented Mar 26, 2020

martindurant commented Mar 26, 2020

shikharsg commented Mar 26, 2020 •

edited

Loading

shikharsg commented Mar 26, 2020

dazzag24 commented Mar 26, 2020

martindurant commented Mar 26, 2020

shikharsg commented Mar 26, 2020

martindurant commented Mar 26, 2020

shikharsg commented Mar 26, 2020

dazzag24 commented Mar 27, 2020

martindurant commented Mar 27, 2020

shikharsg commented Mar 30, 2020

dazzag24 commented Mar 30, 2020

martindurant commented Mar 30, 2020

Chroxvi commented Jul 13, 2020

jakirkham commented Jun 1, 2021

Problems faced while storing onto Zarr store using ABSStore #528

Problems faced while storing onto Zarr store using ABSStore #528

Comments

dokooh commented Dec 12, 2019 • edited Loading

Problem description

Version and installation information

alimanfoo commented Dec 12, 2019 via email

tjcrone commented Dec 12, 2019

dokooh commented Dec 12, 2019

jakirkham commented Dec 12, 2019

shikharsg commented Dec 12, 2019 • edited Loading

dokooh commented Jan 9, 2020

alimanfoo commented Jan 9, 2020

dokooh commented Jan 10, 2020 • edited Loading

shikharsg commented Mar 26, 2020

alimanfoo commented Mar 26, 2020

martindurant commented Mar 26, 2020

shikharsg commented Mar 26, 2020 • edited Loading

shikharsg commented Mar 26, 2020

dazzag24 commented Mar 26, 2020

martindurant commented Mar 26, 2020

shikharsg commented Mar 26, 2020

martindurant commented Mar 26, 2020

shikharsg commented Mar 26, 2020

dazzag24 commented Mar 27, 2020

martindurant commented Mar 27, 2020

shikharsg commented Mar 30, 2020

dazzag24 commented Mar 30, 2020

martindurant commented Mar 30, 2020

Chroxvi commented Jul 13, 2020

jakirkham commented Jun 1, 2021

dokooh commented Dec 12, 2019 •

edited

Loading

shikharsg commented Dec 12, 2019 •

edited

Loading

dokooh commented Jan 10, 2020 •

edited

Loading

shikharsg commented Mar 26, 2020 •

edited

Loading