improve performance of unique with inverse indices #16145

ngimel · 2019-01-18T07:37:08Z

Partial fix for #15804, only w/o dim.
For @jcjohnson benchmarking script I'm getting the following results on V100:
Before:

unning with N = 10000, M = 10000
cuda (no inverse): 0.98 ms
cpu (no inverse): 0.96 ms
cuda (with inverse): 1.07 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.76 ms
cpu (no inverse): 1.53 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 3.02 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.28 ms
cpu (no inverse): 11.22 ms
cuda (with inverse): 69.76 ms
cpu (with inverse): 20.28 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.78 ms
cpu (no inverse): 18.78 ms
cuda (with inverse): 133.45 ms
cpu (with inverse): 34.09 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.43 ms
cpu (no inverse): 61.13 ms
cuda (with inverse): 3315.18 ms
cpu (with inverse): 104.57 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.86 ms
cpu (no inverse): 96.44 ms
cuda (with inverse): 5209.93 ms
cpu (with inverse): 176.10 ms

After

Running with N = 10000, M = 10000
cuda (no inverse): 1.04 ms
cpu (no inverse): 0.94 ms
cuda (with inverse): 0.64 ms
cpu (with inverse): 1.76 ms

Running with N = 10000, M = 100000
cuda (no inverse): 0.77 ms
cpu (no inverse): 1.55 ms
cuda (with inverse): 0.58 ms
cpu (with inverse): 2.79 ms

Running with N = 100000, M = 100000
cuda (no inverse): 1.30 ms
cpu (no inverse): 14.15 ms
cuda (with inverse): 1.63 ms
cpu (with inverse): 20.90 ms

Running with N = 100000, M = 1000000
cuda (no inverse): 0.82 ms
cpu (no inverse): 18.63 ms
cuda (with inverse): 0.61 ms
cpu (with inverse): 33.52 ms

Running with N = 500000, M = 500000
cuda (no inverse): 1.51 ms
cpu (no inverse): 59.81 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 110.69 ms

Running with N = 500000, M = 5000000
cuda (no inverse): 0.92 ms
cpu (no inverse): 104.26 ms
cuda (with inverse): 0.84 ms
cpu (with inverse): 187.12 ms

soumith · 2019-01-18T16:56:07Z

before I land this, could you please locally check correctness as well, against CPU.

The unit-tests only cover small_3d test cases.

jcjohnson · 2019-01-18T19:54:23Z

This is amazing -- thank you so much @ngimel for the quick fix!

ngimel · 2019-01-18T20:52:24Z

@soumith tested for Justin's benchmark's cases, all clean. Do you want some tests added to the test suite?

import torch
for N in [10000, 100000, 500000]:
   for M in [N, 10 * N]:
      print('Running with N = %d, M = %d' % (N, M))
      a = torch.randint(M, size=(N,))
      unique, indices = a.unique(return_inverse=True)
      a=a.cuda()
      unique_gpu, indices_gpu = a.unique(return_inverse=True)
      unique = unique.cuda()
      indices = indices.cuda()
      print((unique-unique_gpu).abs().max(), (indices - indices_gpu).abs().max())

soumith · 2019-01-18T21:20:01Z

@ngimel your testing sounds good. and there are smoke tests with small_3d matrices in test_torch / test_cuda which are hopefully sufficient. If you see anything worth adding as a unit-test, do send a follow-up PR, I'll land this now

facebook-github-bot

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Partial fix for #15804, only w/o dim. For jcjohnson benchmarking script I'm getting the following results on V100: Before: ``` unning with N = 10000, M = 10000 cuda (no inverse): 0.98 ms cpu (no inverse): 0.96 ms cuda (with inverse): 1.07 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.76 ms cpu (no inverse): 1.53 ms cuda (with inverse): 1.23 ms cpu (with inverse): 3.02 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.28 ms cpu (no inverse): 11.22 ms cuda (with inverse): 69.76 ms cpu (with inverse): 20.28 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.78 ms cpu (no inverse): 18.78 ms cuda (with inverse): 133.45 ms cpu (with inverse): 34.09 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.43 ms cpu (no inverse): 61.13 ms cuda (with inverse): 3315.18 ms cpu (with inverse): 104.57 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.86 ms cpu (no inverse): 96.44 ms cuda (with inverse): 5209.93 ms cpu (with inverse): 176.10 ms ``` After ``` Running with N = 10000, M = 10000 cuda (no inverse): 1.04 ms cpu (no inverse): 0.94 ms cuda (with inverse): 0.64 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.77 ms cpu (no inverse): 1.55 ms cuda (with inverse): 0.58 ms cpu (with inverse): 2.79 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.30 ms cpu (no inverse): 14.15 ms cuda (with inverse): 1.63 ms cpu (with inverse): 20.90 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.82 ms cpu (no inverse): 18.63 ms cuda (with inverse): 0.61 ms cpu (with inverse): 33.52 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.51 ms cpu (no inverse): 59.81 ms cuda (with inverse): 1.23 ms cpu (with inverse): 110.69 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.92 ms cpu (no inverse): 104.26 ms cuda (with inverse): 0.84 ms cpu (with inverse): 187.12 ms ``` Pull Request resolved: pytorch/pytorch#16145 Differential Revision: D13738821 Pulled By: soumith fbshipit-source-id: 0811fb4ade47e3b466cebbc124e3f3333a986749

gkioxari · 2019-01-23T00:56:07Z

@ngimel @jcjohnson Is it possible to have the unique_index be returned as well, when the appropriate flag is turned on? This would be a similar functionality to np.unique()

ngimel · 2019-01-23T01:02:59Z

Yes, it's possible, my original commit was returning unique_index instead of inverse_index, so just restore thrust::unique_by_key call with appropriate handling of options? 883edac

Summary: Partial fix for #15804, only w/o dim. For jcjohnson benchmarking script I'm getting the following results on V100: Before: ``` unning with N = 10000, M = 10000 cuda (no inverse): 0.98 ms cpu (no inverse): 0.96 ms cuda (with inverse): 1.07 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.76 ms cpu (no inverse): 1.53 ms cuda (with inverse): 1.23 ms cpu (with inverse): 3.02 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.28 ms cpu (no inverse): 11.22 ms cuda (with inverse): 69.76 ms cpu (with inverse): 20.28 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.78 ms cpu (no inverse): 18.78 ms cuda (with inverse): 133.45 ms cpu (with inverse): 34.09 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.43 ms cpu (no inverse): 61.13 ms cuda (with inverse): 3315.18 ms cpu (with inverse): 104.57 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.86 ms cpu (no inverse): 96.44 ms cuda (with inverse): 5209.93 ms cpu (with inverse): 176.10 ms ``` After ``` Running with N = 10000, M = 10000 cuda (no inverse): 1.04 ms cpu (no inverse): 0.94 ms cuda (with inverse): 0.64 ms cpu (with inverse): 1.76 ms Running with N = 10000, M = 100000 cuda (no inverse): 0.77 ms cpu (no inverse): 1.55 ms cuda (with inverse): 0.58 ms cpu (with inverse): 2.79 ms Running with N = 100000, M = 100000 cuda (no inverse): 1.30 ms cpu (no inverse): 14.15 ms cuda (with inverse): 1.63 ms cpu (with inverse): 20.90 ms Running with N = 100000, M = 1000000 cuda (no inverse): 0.82 ms cpu (no inverse): 18.63 ms cuda (with inverse): 0.61 ms cpu (with inverse): 33.52 ms Running with N = 500000, M = 500000 cuda (no inverse): 1.51 ms cpu (no inverse): 59.81 ms cuda (with inverse): 1.23 ms cpu (with inverse): 110.69 ms Running with N = 500000, M = 5000000 cuda (no inverse): 0.92 ms cpu (no inverse): 104.26 ms cuda (with inverse): 0.84 ms cpu (with inverse): 187.12 ms ``` Pull Request resolved: #16145 Differential Revision: D13738821 Pulled By: soumith fbshipit-source-id: 0811fb4ade47e3b466cebbc124e3f3333a986749

Summary: pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Differential Revision: D15073468 fbshipit-source-id: a9710738c827013afb35a67bd3a9be259b0e2d5f

…torch#537) Summary: Pull Request resolved: pytorch#537 pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Differential Revision: D15073468 fbshipit-source-id: 713288fc7c77f582840f270dd2e343a3b63f8fe5

…torch#537) Summary: Pull Request resolved: pytorch#537 pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Differential Revision: D15073468 fbshipit-source-id: 29c7eaaddd63d629866c7314920fe27b22690603

Summary: Pull Request resolved: #537 pytorch/pytorch#8899 had added CUDA support for `torch.unique()` pytorch/pytorch#16145 has some timing stats that could be relevant --- Experiment results: https://fb.quip.com/olQOA853j0mb Words per second (`gpu-unique_wps_avg_vs_base`): 1.046x Total train time (`gpu-unique_total_train_time_vs_base`; excl ar_AR-fr_XX): 0.987x Even though train time reduction is pretty minimal (probably overshadowed by random variance, scheduling delay, etc), WPS does seem to be ~5% faster - so might as well land this. Training time for ar_AR-fr_XX increased significantly - but that's b/c it trained for many more updates (`gpu-unique_num_updates_avg_vs_base`) - and also ended up w/ +1.43 BLEU. I think this is probably just an anomaly? Reviewed By: akinh, jmp84 Differential Revision: D15073468 fbshipit-source-id: c2dba562b6d4fb4d15d2a56d03ce6a6e3ddff07d

improve performance of unique with inverse indices

883edac

soumith added this to the 1.0.1 milestone Jan 18, 2019

fix my misunderstanding of what inverse indices are

71d946f

remove commented line

2862e9f

facebook-github-bot reviewed Jan 18, 2019

View reviewed changes

facebook-github-bot closed this in ed57425 Jan 18, 2019

ngimel deleted the unique branch January 23, 2019 23:11

soumith added the cherry-picked This PR was cherry-picked onto a release branch from master label Jan 24, 2019

zasdfgbnm mentioned this pull request Mar 24, 2019

_unique_dim needs refactor #18405

Closed

zasdfgbnm mentioned this pull request Apr 6, 2019

Performance issue of unique on CPU #18987

Open

theweiho mentioned this pull request May 16, 2019

Stop copying tensors to CPU for torch.unique() in vocab reduction pytorch/translate#537

Closed

ezyang added open source merged labels Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of unique with inverse indices #16145

improve performance of unique with inverse indices #16145

ngimel commented Jan 18, 2019

soumith commented Jan 18, 2019

jcjohnson commented Jan 18, 2019

ngimel commented Jan 18, 2019

soumith commented Jan 18, 2019

facebook-github-bot left a comment

gkioxari commented Jan 23, 2019 •

edited

ngimel commented Jan 23, 2019

improve performance of unique with inverse indices #16145

improve performance of unique with inverse indices #16145

Conversation

ngimel commented Jan 18, 2019

soumith commented Jan 18, 2019

jcjohnson commented Jan 18, 2019

ngimel commented Jan 18, 2019

soumith commented Jan 18, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

gkioxari commented Jan 23, 2019 • edited

ngimel commented Jan 23, 2019

gkioxari commented Jan 23, 2019 •

edited