New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.blkdiag [A way to create a block-diagonal matrix] #31932
Comments
If you think it is necessary but don't have time to implement it, I can finish the relevant code (maybe a CUDA kernel for GPU and a CPU version). |
I'm also looking for a solution to blkdiag method. However, it's important that this must create a sparse tensor, otherwise the time complexity will be very high (quadratic overhead on batch_size). A dense blkdiag method can't be used in graph neural networks since it's slower than operate graphs one by one. |
u said it, then we better implement a sparse matrix version and a normal version. I have to say, a sparse |
I'm not sure what our procedure for taking new operators is (I've marked this as triage review so that the team can discuss). However this sounds like a perfectly reasonable feature request to me. |
Given this is in numpy (https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.linalg.block_diag.html) and it creates a dense matrix, we should just do it. |
u said it. In my project, my current implementation is:
However, if we want to use it as the base API, we should refactor this part of the code with C and CUDA. Are you doing this? If not, I will start a PR. |
By the way, I do agree @ThyrixYang has his point. We could implement |
I don't think we should complicate this issue by bringing up sparse blkdiag -- there is already a separate issue for that: #31942. |
I can work on this. |
k, many thanks, and good luck dude. Don't forget to point your PR to this issue. I'll take a look if you need help. |
I notice that If not, I'll just use a If we do choose to use TensorList, then it's worth noting that the scipy function allows you to supply zero arguments, but TensorList does not allow you to supply an empty list. In other words, |
The above commit adds a CPU implementation of block_diag. As mentioned in my previous comment, I used a TensorList for its argument, rather than a variable length argument list, because I don't know how (or if it's possible) to do that in pytorch. I can change it, if anyone wants me to and can tell me how. Next I'm going to look into how performant my implementation is, comparing against the performance of tczhangzhi's workaround from this comment: #31932 (comment) |
@kurtamohler There's an ambiguity in
There are some functions in our API that take varargs, like |
Oh I see. Then it might not worth it to make the empty input work, right? Does anyone in this thread have a need for it? |
I've added a check to make sure all the input tensors have the same scalar type, and throw an error if not. Is this alright with everyone? Alternatively, I could just convert every tensor to match the first one in the argument list. |
The CPU performance of my implementation is at least as good as tczhangzhi's workaround for the input sizes and data types I've tried so far. Here's my performance measurement script:
And here are my measurements:
The speedups are all near 1 or greater. As the input sizes get larger, the speedup seems to approach 1, which makes sense. With larger inputs, the overhead of the python calls in the workaround become more negligible. Also, my implementation is algorithmically similar to the workaround. I wouldn't be surprised if there's way to speed up block_diag further, but I won't focus on that for now. I'll start implementing the CUDA version. |
If I use cuda tensors as the input to my existing block_diag implementation, I get pretty good parallel scaling when the number of input matrices is fairly small. But as the number of matrices increases, the CUDA speedup decreases. This makes sense because I'm using a serial for loop over all the matrices. Here's the performance comparison:
I wonder if I can get away with avoiding creating a specialized CUDA implementation. I've seen a function called |
I think After thinking a bit more, I guess there might not be a good way to further parallelize the CUDA performance of block_diag with a specialized implementation, since each Tensor in the TensorList is allowed to be a different size. So for now, I'll stop thinking about this unless someone can recommend a good CUDA parallelization strategy. I'll start writing a test now, and once that's done I'll start a PR. |
nice job, dude! Very satisfied with the result using CUDA! Quite busy these days will look into ur code later~ |
Also add API documentation Issue pytorch#31932
I created a new issue #36500 to implement sparse support for block_diag. I can work on it if the Facebook team decides to mark it high priority. |
Oh woops, didn't realize that #31942 already existed for sparse support. Closing the new ticket. |
馃殌 Feature
A way to create a block-diagonal matrix:
Motivation
Graph deep learning is getting more and more attention. The commonly used acceleration operation is to merge the adjacency matrices of subgraphs into a large adjacency matrix. No doubt we need a faster functionality of
bkldiag
.The name comes from the function
bkldiag
in matlab: https://www.mathworks.com/help/matlab/ref/blkdiag.htmlThe implementation of this method has been discussed in the community:
https://discuss.pytorch.org/t/creating-a-block-diagonal-matrix/17357
https://discuss.pytorch.org/t/creating-a-block-diagonal-matrix/22592
https://stackoverflow.com/questions/54856333/pytorch-diagonal-matrix-block-set-efficiently/56638727#56638727
Pitch
Alternatives
#31942 discussed the sparse version of this method, which I think is also necessary.
cc @ezyang @gchanan @zou3519
The text was updated successfully, but these errors were encountered: