-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Description
Originally raised in #50937 (comment)
The COO to CSR tensor conversion is implemented in
Lines 928 to 940 in 3a777b6
if self.is_sparse: | |
coalesced_self = self.coalesce() | |
row_indices = coalesced_self.indices()[0] | |
ro = [0] | |
i = 0 | |
for irow in range(self.shape[0]): | |
while i < row_indices.size()[0] and row_indices[i] == irow: | |
i += 1 | |
ro.append(i) | |
return torch.sparse_csr_tensor(torch.tensor(ro, dtype=row_indices.dtype), | |
coalesced_self.indices()[1], coalesced_self.values(), | |
size=coalesced_self.shape, dtype=coalesced_self.dtype) |
and it consists of two time-consuming parts:
- coalescing if the COO tensor is uncoaleasced
- row indices compression procedure
Indeed, taking a coalesced COO sample with size=(1000, 1000)
and nnz=500
, about 1.5% of the to_sparse_csr
call time is spend in coalescing and 98% of the time in running the row indices compression loop.
The slowness of the row indices compression loop has two origins:
- it is implemented in Python
- it uses a suboptimal algorithm
Indeed, a sample of row indices with size[0] == 1000
and nnz=500
, a pure Python algorithm in
https://github.com/pearu/gcs/blob/b54ba0cba9c853b797274ef26b6c42386f2cafa3/gcs/storage.py#L24-L45
is about 3x faster than used in to_sparse_csr
, and when applying numba.njit
to the compress_indices
function, about 83x speedup is achieved compared to to_sparse_csr
.
The performance of COO to CSR tensor conversion can be improved by implementing a more efficient row indices compression algorithm (see compress_indices
referenced above) in C++.
cc @VitalyFedyunin @ngimel @aocsa @nikitaved @pearu @mruberry