Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Cannot cast array data from dtype('O') to dtype('int64') when applying histogram to image time series #9

Closed
rbavery opened this issue Aug 13, 2019 · 4 comments

Comments

@rbavery
Copy link

rbavery commented Aug 13, 2019

I've tried to create a minimal reproducible example (similar to the tutorial, which I'm able to run), but working with a 3D DataArray. But I get an error:

from xhistogram.xarray import histogram 
import numpy as np
import xarray as xr
def tseries_histogram_labels(data_arr, rmin, rmax, nbins):
    bin_arr = np.linspace(rmin, rmax, nbins)
    time_len = data_arr.time.shape[0]
    bin_arrs = list(bin_arr * np.ones((time_len,1)))
    return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
ny, nx = 100, 100
nt = 44
data_arr = xr.DataArray(np.random.randn(nt,ny,nx), dims = ['time', 'y', 'x'], name='blue reflectance')
tseries_histogram_labels(data_arr, -4, 4, 50)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
      9 nt = 44
     10 data_arr = xr.DataArray(np.random.randn(nt,ny,nx), dims = ['time', 'y', 'x'], name='blue reflectance')
---> 11 tseries_histogram_labels(data_arr, -4, 4, 50)

 in tseries_histogram_labels(data_arr, rmin, rmax, nbins)
      5     time_len = data_arr.time.shape[0]
      6     bin_arrs = list(bin_arr * np.ones((time_len,1)))
----> 7     return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
      8 ny, nx = 100, 100
      9 nt = 44

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/xarray.py in histogram(bins, dim, weights, density, block_size, bin_dim_suffix, bin_edge_suffix, *args)
    127 
    128     h_data = _histogram(*args_data, weights=weights_data, bins=bins, axis=axis,
--> 129                         block_size=block_size)
    130 
    131     # create output dims

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in histogram(bins, axis, weights, density, block_size, *args)
    245     h = _histogram_2d_vectorized(*all_args_reshaped, bins=bins,
    246                                  weights=weights_reshaped,
--> 247                                  density=density, block_size=block_size)
    248 
    249     if h.shape[0] == 1:

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in _histogram_2d_vectorized(bins, weights, density, right, block_size, *args)
    124 
    125     bin_counts = _dispatch_bincount(bin_indices, weights, N, hist_shapes,
--> 126                                     block_size=block_size)
    127 
    128     # just throw out everything outside of the bins, as np.histogram does

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in _dispatch_bincount(bin_indices, weights, N, hist_shapes, block_size)
     79     if len(block_chunks) == 1:
     80         # single global chunk, don't need a loop over chunks
---> 81         return _bincount_2d(bin_indices, weights, N, hist_shapes)
     82     else:
     83         return _bincount_loop(bin_indices, weights, N, hist_shapes, block_chunks)

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in _bincount_2d(bin_indices, weights, N, hist_shapes)
     28     bin_indices_offset = (bin_indices + (N * np.arange(M)[:, None])).ravel()
     29     bc_offset = bincount(bin_indices_offset, weights=weights,
---> 30                             minlength=N*M)
     31     final_shape = (M,) + tuple(hist_shapes)
     32     return bc_offset.reshape(final_shape)

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/duck_array_ops.py in f(*args, **kwargs)
     23             else:
     24                 module = eager_module
---> 25             return getattr(module, name)(*args, **kwargs)
     26     else:
     27         def f(data, *args, **kwargs):

<__array_function__ internals> in bincount(*args, **kwargs)

TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

I'm not sure what is causing this error.

@rabernat
Copy link
Contributor

Thanks for the bug report @rbavery - this package is very new, so issues like this are very important for ironing out the api and documentation.

Your case is most simple to the tutorial example histogram over a single axis. The only difference is that you have one extra dimension.
I tried to guess what you were trying to do here and refactored your example:

from xhistogram.xarray import histogram 
import numpy as np
import xarray as xr

ny, nx = 100, 100
nt = 44
data_arr = xr.DataArray(np.random.randn(nt,ny,nx),
                        dims=['time', 'y', 'x'],
                        name='blue reflectance')

rmin, rmax, nbins = -4, 4, 50
bin_arr = np.linspace(rmin, rmax, nbins)
histogram(data_arr, bins=[bin_arr], dim=['x','y'])

This works and produces an output that looks like this

<xarray.DataArray 'histogram_blue reflectance' (time: 44, blue reflectance_bin: 49)>
array([[0, 0, 3, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 3, 0, 0],
       ...,
       [0, 0, 1, ..., 1, 0, 0],
       [0, 1, 3, ..., 0, 1, 1],
       [0, 0, 3, ..., 2, 0, 1]])
Coordinates:
  * blue reflectance_bin  (blue reflectance_bin) float64 -3.918 -3.755 ... 3.918
Dimensions without coordinates: time

In words, this calculates the histogram over the x,y dimensions for each timestep. I assume that is what you wanted to do.

There were a few things wrong with your original example:

  • histogram(*data_arr, ... should be histogram(data_arr, ..., no need to use the *args syntax here. None of the examples uses that, so I'm not sure where you got it. Perhaps the doc string? Let me know how we can make the documentation more clear on this point.
  • There is no need to do bin_arrs = list(bin_arr * np.ones((time_len,1))). Since you only have one input (data-arr), then your list of bins just needs one entry. Let me know how we can make the documentation more clear on this point.

In summary, the usage of xhistogram in this scenario is simpler than you were assuming.(I hope I have not misunderstood your goals.)

Perhaps it would be worth adding this example to the tutorial, as it seems like a very common use case (histogram over image timeseries).

@rbavery
Copy link
Author

rbavery commented Aug 13, 2019

Thanks a bunch for the clarification, I misunderstood the docstring.

For the bins argument, this is the comment if Dask should be used

If a list arrays, the bin edges for each argument in args (required format for Dask inputs).

I took that to mean a list of arrays is needed, with each array corresponding to a DataArray object, since the docstring also states that it accepts multiple DataArray objects and I was working with Dask arrays.

Parameters
argsxarray.DataArray objects

I think it would help to clarify when multiple DataArray objects and bin arrays should be supplie d as inputs, with documentation examples of each. And I think a time series example would be helpful to include, I can make a PR if you'd like.

My goal is actually not to get the counts of reflectance values that fall within each histogram bin but to return a series of 2D arrays where each pixel is labeled numerically by the bin index of the bin that contains the pixel value.

I use this series of labels to calculate the means of each bin of each blue reflectance image and also the corresponding red reflectance means for each bin for each image. This is why I need the labelled array and can't just calculate the means following the last example in the xhistogram tutorial.

scipy.stats.binned_statistic returns the binnumber, or the bin label for each pixel, which is what my original function in this stack overflow post relies on. I realized that xhistogram doesn't fit my use case a bit late but still appreciate the help and am glad to discover a new useful tool.

@rabernat
Copy link
Contributor

Thanks a lot for the feedback! We will try to update the docs with this in mind.

My goal is actually not to get the counts of reflectance values that fall within each histogram bin but to return a series of 2D arrays where each pixel is labeled numerically by the bin index of the bin that contains the pixel value.

It sound like you want numpy.digitize.

@rbavery
Copy link
Author

rbavery commented Aug 13, 2019

Thanks for the suggestion, this looks to be exactly what I need! I've updated my stack overflow post accordingly, still having trouble applying numpy.digitize with apply_ufunc: https://stackoverflow.com/questions/57419541/how-to-use-apply-ufunc-with-numpy-digitize-for-each-image-along-time-dimension-o

@rbavery rbavery closed this as completed Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants