Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N dimensional histogram function in pytorch as in numpy.histogramdd #29209

Closed
miranov25 opened this issue Nov 5, 2019 · 16 comments
Closed

N dimensional histogram function in pytorch as in numpy.histogramdd #29209

miranov25 opened this issue Nov 5, 2019 · 16 comments
Labels
function request A request for a new function or the addition of new arguments/modes to an existing function. module: numpy Related to numpy support, and also numpy compatibility of our operators OSS contribution wanted PR from open source contributors welcome to solve this issue. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@miranov25
Copy link

miranov25 commented Nov 5, 2019

🚀 Feature -N dimensional histogram function in pytorch as in numpy

It would be great to have support for multi-dimensional histograms with similar functionality as in the

numpy.histogramdd

See discussion within https://discuss.pytorch.org/t/n-dimensional-histogram-function-in-pytorch-cpu-gpu/59888

Motivation

We are working on interactive visualization of multi-dimensional data
Usually O(2-7) dimensions (still dense) Currently we are using numpy implementation for multidimensional histogram filling, slicing, summing:

  • creating NDimensional histogram once:
    H, axis = np.histogramdd(inputArray.values, bins=bins, range=hRange)
  • querying interactively - many times
    y = np.sum(H[hSliceLocal], axis=axis)

We would like to use PyTorch (or Numba) to get faster GPU implementation

Pitch

I checked numpy implementation in https://github.com/numpy/numpy/blob/v1.17.0/numpy/lib/histograms.py#L935-L1113
At the end algorithm in (step 3) is using 1D histogramming - np.bincount
However, numpy implementation assume non uniform binning - to find bins they use internally np.searchsorted.

Alternatives

In case searchsorted will be difficult to implement (I did not find it in pytorch= only external https://github.com/aliutkus/torchsearchsorted), uniform binning will be sufficient.

Additional context

Current numpy Algorithm:

  1. Calculate bin edges if not provided
  2. Compute the bin number each sample falls into - assuming non uniform binning
    np.searchsorted
  3. Compute the sample indices in the flattened histogram matrix
    np.ravel_multi_index
  4. Compute the number of repetitions in xy and assign it to the flattened histmat.
    np.bincount
  5. Shape into a proper matrix
    hist.reshape(nbin)

cc @mruberry @rgommers

@albanD albanD added feature A request for a proper, new feature. module: operators triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module OSS contribution wanted PR from open source contributors welcome to solve this issue. labels Nov 5, 2019
@leofang
Copy link
Contributor

leofang commented Jul 16, 2020

Hi @miranov25, I was digging some histogramdd related issues and found this. May I ask what's the seven dimensional data you have in mind? Just curious 🙂

@miranov25
Copy link
Author

Hello @leofang
We already have pytorch based implementation of the histogramdd in our github (https://github.com/miranov25/RootInteractive/blob/b54446e09072e90e17f3da72d5244a20c8fdd209/RootInteractive/Tools/Histograms/histogramdd.py). We will try soon to make pull request to Pytorch.
We are waiting for new pytorch with searchsorted (now in pre-release) and than we will proceed.

Concerning ND data in CERN (particle physics) we have huge amount of quite dense data. See example bellow.
Our use case pipeline:`

  • Unbinned data
  • ND histograms
  • ND PDF statistics maps (mean, rms, bin median, custom fits. e.g. Gaussian regression)
  • defining deriving data on the set of maps
    • check invariants
    • define maps with detector/ physics interpretation
  • Analytical fits/physical models/regression + ML
  • (Interctive) Visualization

We have plenty of use cases. Until recently we were working with C++ implementation. Now we are moving implementation to
Python, working on RootInteractive project. For the moment we are using mix of C++ and Python.

Visualization Examples based on the ND histograms

Reconstruction performance parameterization (5-6D)

  • Δp (e.g position, angle, momenta)
  • 1/momenta
  • angle
  • track multiplicity
  • particle type
  • reconstruction setting
    • Detector(s) ON/OF

Detector - distortion map calibration (5-6 dimensions)

  • Δpos
  • 3D position (r,φ,z)
  • 1-3D - Δ current - fourier transform

particle identification - 6D

  • Δ PID
  • primary mulitplicity
  • pile up mutplicity
  • mean pile-up position
  • particle inclination angle
  • 1/PID expected

@miranov25
Copy link
Author

Hello @albanD

My collaborators implemented requested functionality in our RootInteractive repository.
See source code, test and PyTorch/numpy benchmark in:
https://github.com/miranov25/RootInteractive/tree/master/RootInteractive/Tools/Histograms
and presentation in:
https://docs.google.com/presentation/d/1vYPTTsn5Zs7X9OWaewUTlnw9hqJ3XKJNGKiAViHv-cY

In case of interest we will be happy to make a pull request to PyTorch.

Regards
Marian

@albanD albanD added the module: numpy Related to numpy support, and also numpy compatibility of our operators label Aug 10, 2020
@albanD
Copy link
Collaborator

albanD commented Aug 10, 2020

Thanks for the pointers.
cc @mruberry is this function on the list for numpy compatibility?

@mruberry
Copy link
Collaborator

mruberry commented Aug 10, 2020

Not being targeted at the moment, no, @albanD, but we do have torch.searchsorted (https://pytorch.org/docs/master/generated/torch.searchsorted.html?highlight=searchsorted#torch.searchsorted) already.

@miranov25
Copy link
Author

We are aware of new searchsorted, we use it already in our code working with current release (1.6)
(slide 2 - https://docs.google.com/presentation/d/1vYPTTsn5Zs7X9OWaewUTlnw9hqJ3XKJNGKiAViHv-cY/edit#slide=id.g7184ebb3b5_0_0)
As I said happy to open a PR in case of interest.
I believe it could be useful for users dealing with dense multidimensional data (see )

Concerning pytorch/numpy comparison, we observe significant difference in performance depending on the number of cores.
For 10^7 points - in our CPU test with 128 cores - factor ~8 in speed PyTorch/Numpy, factor 3 on 12 cores CPU)
We are preparing new versions of benchmark. For a moment we have summary in tables
(https://github.com/miranov25/RootInteractive/blob/master/RootInteractive/Tools/Histograms/benchmark_histogramdd_pytorch.py)

@mruberry
Copy link
Collaborator

cc @ngimel for the performance

This is really interesting stuff. Give me an opportunity to take a closer look at the provided links.

@mruberry
Copy link
Collaborator

Finally had a chance to take a closer look.

I think we would accept histogram and histogramdd functions analogous to NumPy. These could reuse existing PyTorch functions, but they'd need to be implemented in C++ like the great majority of PyTorch functions so they're available in both the Python and C++ APIs. Starting with histogram, which may be little more than an alias for existing PyTorch functionality, would make a lot of sense. If someone on this issue is interested I can give you a few pointers to help you get started. I also added histogram to the NumPy rollup issue: #38349.

@miranov25
Copy link
Author

Hello @mruberry and @albanD
I discussed with my collaborator - Marian - pl0xz0rz who made Python implementation in RootInteractive miranov25/RootInteractive#50

Please provide us pointers, so he can can start.
We plan to port also makePDFmaps (see #29209 (comment)) - as an extension of histogramdd. Currently we are using numpy based implementation. In case of interest I will ask Marian to implement it directly in C++.

@mruberry
Copy link
Collaborator

Take a look at #42755, which added the quantile operator, and #42799, which implemented a few simple composite ops, and #43400, which added an alias for an existing PyTorch function.

I would start by implementing histogram by

  • seeing if it's an alias for an existing PyTorch function (we have torch.histc which is similar to numpy.histogram, but I don't think they're exact aliases)
  • seeing if you can implement histogram correctly and performantly using existing PyTorch functions, which is what the example PRs do

If a custom kernel is needed then let's identify what kind of kernel we'd like to use and we can discuss implementation strategies. Reviewing the histc implementations for CPU and CUDA might be interesting, too:

void THTensor_(histc)(THTensor *hist, THTensor *tensor, int64_t nbins, scalar_t minvalue, scalar_t maxvalue)

Moving the histc implementation from TH to ATen may also be a good place to start as it would cover registering a function with ATen and validating the implementation.

@pl0xz0rz
Copy link
Contributor

Hello @mruberry

I have implemented an operator for histogram in #44485 with support for non-uniform binning and the density flag which works like in NumPy, but I'm not entirely sure about the interface, as it only outputs one Tensor which is the histogram, while Numpy also outputs the array of bin edges.

With histogramdd there are more questions about the interface.

  1. It seems that I can obtain better performance using only existing PyTorch operators when using a (D, N) Tensor than (N, D) as NumPy does. However, it might be the case that it can change if the algorithm is optimized more.
  2. Bins should be either an int, 1D Tensor of integral type or Tensor[] of the same dtype as self?
  3. How should the range argument work? Should there be a tensor for min and a tensor for max, each of length D?
  4. Density and weights seem to be straightforward, they can be used exactly as in Numpy
  5. As for the output, should it return a (hist, edges) Tuple or only the hist Tensor which is the histogram itself?

@mruberry
Copy link
Collaborator

With histogramdd there are more questions about the interface.

  1. It seems that I can obtain better performance using only existing PyTorch operators when using a (D, N) Tensor than (N, D) as NumPy does. However, it might be the case that it can change if the algorithm is optimized more.

How significant is the performance gap? PyTorch doesn't typically support array_likes, so I would expect us to only support the first (N, D) array option.

  1. Bins should be either an int, 1D Tensor of integral type or Tensor[] of the same dtype as self?

You could start by only supporting a single integer (bins for all dimensions).

  1. How should the range argument work? Should there be a tensor for min and a tensor for max, each of length D?

You could start by putting range in the signature but not supporting values other than None.

  1. Density and weights seem to be straightforward, they can be used exactly as in Numpy

Agreed. You could also not implement this to start.

  1. As for the output, should it return a (hist, edges) Tuple or only the hist Tensor which is the histogram itself?

The return type should match NumPy, so a tuple (H, edges).

@miranov25
Copy link
Author

miranov25 commented Sep 21, 2020

Hello @mruberry and @pl0xz0rz

I have comments related to pull request #44485 in context of this issue #29209

@mruberry mruberry added function request A request for a new function or the addition of new arguments/modes to an existing function. and removed feature A request for a proper, new feature. module: operators (deprecated) labels Oct 8, 2020
@saketh-are
Copy link
Contributor

Closed by #65318

@francois-rozet
Copy link

francois-rozet commented Oct 22, 2021

@saketh-are Does your implementation handle CUDA tensors ? It does not seem to, so I would not close this issue until then.

@francois-rozet
Copy link

torch.histogramdd is available as of PyTorch 1.11 (see https://pytorch.org/docs/1.11/generated/torch.histogramdd.html)

@albanD albanD closed this as completed Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
function request A request for a new function or the addition of new arguments/modes to an existing function. module: numpy Related to numpy support, and also numpy compatibility of our operators OSS contribution wanted PR from open source contributors welcome to solve this issue. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants