Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation with large lookup tables - optimizer._apply_sparse_duplicate_indices (TF V1.0.1) #10270

Closed
KashiErez opened this issue May 29, 2017 · 18 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:performance Performance Issue

Comments

@KashiErez
Copy link

KashiErez commented May 29, 2017

Hi,

I ran into this performance issue while trying to upgrade tensorflow from version 0.12.1 to 1.X.

We ran a network with large embedding lookup tables:

  • 100K X 32 (for example, word embedding - with 100K unique words)
  • 300K X 128 (for example, categorical feature with cardinality of 300K unique items)

After upgrading TF version to 1.0.1, GPU usage dropped in from 60% to 30%.
Training time went up in 50%-200% (depends on how big is the embedding lookup table).

This is the commit that caused the performance degradation:
f9f56f9

The handling of unique indexes is very slow and does not run in parallel with others operations.
Please note the big unique blocks in the middle.
trace_unique

Here is a work around (not handling unique indexes ):

class MyOptimizer(tf.train.AdamOptimizer):
        def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8,
               use_locking=False, name="Adam"):
                super(MyOptimizer,self).__init__(learning_rate,beta1, beta2, epsilon, use_locking,name)

        def _apply_sparse_duplicate_indices(self, grad, var):
                return self._apply_sparse(grad, var)

Thanks,
Erez

@KashiErez
Copy link
Author

KashiErez commented May 29, 2017

Forgot to mention - this is the GPU info:

$nvidia-smi
Mon May 29 08:50:48 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      Off  | 0000:03:00.0     Off |                    0 |
| N/A   23C    P8    16W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      Off  | 0000:82:00.0     Off |                    0 |
| N/A   21C    P8    17W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@KashiErez
Copy link
Author

I using docker base image:

nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04

nvidia-docker --version
Docker version 1.12.5, build 7392c3b

@KashiErez
Copy link
Author

KashiErez commented May 29, 2017

This is how a good trace looks like: (after applying downgrading to version 12.0.1 or applying the workaround)
good_trace_no_unique

Note that there is higher parallelism.
No big 'unique block' in the middle.

@allenlavoie
Copy link
Member

Interesting timelines, and sorry you're running into this. Some/most of this time is likely copying to host memory: we don't actually have a GPU kernel for unique, but one needed to be registered to avoid interfering with op placements. The computation happens on the CPU.

So this could be sped up by implementing a real GPU kernel for Unique. That's likely preferable to fusing Adam's sparse updates into a GPU kernel, although it is another possibility.

So any interest in writing a GPU kernel for unique? I don't know of anyone who is working on one right now.

@allenlavoie
Copy link
Member

@zhangyaobit @yzhwang since we discussed this.

@KashiErez: One useful clarification would be how many indices are going into the UniqueOp, which should be equal to the number of embeddings that are accessed in each iteration (e.g. sentence length).

@allenlavoie
Copy link
Member

Also @ekelsen, who has been thinking about using CUB as a way to implement these kinds of ops on the GPU.

@ekelsen
Copy link
Contributor

ekelsen commented Jun 1, 2017

Yeah, hopefully CUB will be usable from TF soon. In that case, unique can be done by sorting and then doing run-length-encoding.

@KashiErez
Copy link
Author

KashiErez commented Jun 2, 2017

Regarding Question: @KashiErez: One useful clarification would be how many indices are going into the UniqueOp, which should be equal to the number of embeddings that are accessed in each iteration (e.g. sentence length).

Answer:

The batch size is 1024.

Regarding words:
The sentence length is 21.
Number if unique words ~ 100K

But We have another categorical feature that has only one value in each iteration.
And the number of unique values is ~ 300K.

From the trace you can see that this categorical feature 'unique op' is running much slower then the word embedding 'unique op'.

So I think that the parameter you should look at first is 'number of unique values' (== lookup table size).

@allenlavoie
Copy link
Member

In that case, could you add a print node to get the exact shape of the Tensor going into unique?

The whole idea of this code path is that the gradients are sparse; the IndexedSlices from the gradient of the embedding lookup has a number of indices equal to the number of embeddings which were actually accessed (which it sounds like should be ~21504 and ~1024?), independent of the size of the embedding lookup table.

@KashiErez
Copy link
Author

Hi,

Not sure I understand where to add the print node.
'_apply_sparse_duplicate_indices' gets two parameters: grad, var
Should I print this tensor shapes?

@KashiErez
Copy link
Author

KashiErez commented Jun 5, 2017

Hi,

This is the optimizer I used to print the indices shape:
(I copied the code from optimizer.Optimizer class, and injected a tf.Print statement).

import tensorflow as tf
from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import math_ops


def _my_deduplicate_indexed_slices(values, indices, grad):
    """Sums `values` associated with any non-unique `indices`.
    Args:
      values: A `Tensor` with rank >= 1.
      indices: A one-dimensional integer `Tensor`, indexing into the first
        dimension of `values` (as in an IndexedSlices object).
    Returns:
      A tuple of (`summed_values`, `unique_indices`) where `unique_indices` is a
      de-duplicated version of `indices` and `summed_values` contains the sum of
      `values` slices associated with each unique index.
    """

    indices = tf.Print(indices,
                       data=[grad.name, tf.shape(indices)],
                       message='$$$$$$$$ indices.shape   ----- ',
                       first_n=1)

    unique_indices, new_index_positions = array_ops.unique(indices)
    summed_values = math_ops.unsorted_segment_sum(
        values, new_index_positions,
        array_ops.shape(unique_indices)[0])
    return (summed_values, unique_indices)


class MyAdamOptimizer(tf.train.AdamOptimizer):
    """
    This Optimizer is a workaround for this issue:
    https://github.com/tensorflow/tensorflow/issues/10270
    """
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8,
                 use_locking=False, name="MyAdam"):
        super(MyAdamOptimizer, self).__init__(learning_rate, beta1, beta2, epsilon, use_locking, name)

    def _apply_sparse_duplicate_indices(self, grad, var):
        """
        overriding method to avoid 'unique indexes' logic.
        'unique indexes' logic produce performance degradation in large lookup tables
        """


        summed_values, unique_indices = _my_deduplicate_indexed_slices(
            values=grad.values, indices=grad.indices, grad=grad)

        gradient_no_duplicate_indices = ops.IndexedSlices(
            indices=unique_indices,
            values=summed_values,
            dense_shape=grad.dense_shape)
        return self._apply_sparse(gradient_no_duplicate_indices, var)

I ran it on 3 models, here are the results:

Model 1 - few small categorical features (no more then 10K each):

2017-06-05 12:49:30.233421: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_8:0][10266]
2017-06-05 12:49:30.234021: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_4:0][1058]
2017-06-05 12:49:30.233421: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat:0][8687]
2017-06-05 12:49:30.233421: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_6:0][1033]
2017-06-05 12:49:30.237197: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_2:0][1036]

Model 2 - small categorical features (no more then 10K each) + words (50K unique values):

2017-06-05 11:39:45.333000: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_8:0][10266]
2017-06-05 11:39:45.332994: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_4:0][1058]
2017-06-05 11:39:45.332984: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_2:0][1036]
2017-06-05 11:39:45.333465: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat:0][8687]
2017-06-05 11:39:45.333727: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_6:0][1033]
2017-06-05 11:39:45.333756: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_10:0][70681]

Model 3 - small categorical features (no more then 10K each) + words (50K unique values) + big categorical feature (300K unique values):

2017-06-05 12:41:13.841337: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_6:0][1033]
2017-06-05 12:41:13.842412: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_4:0][1057]
2017-06-05 12:41:13.840662: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_8:0][9131]
2017-06-05 12:41:13.841332: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_12:0][40896]
2017-06-05 12:41:13.841334: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat:0][8632]
2017-06-05 12:41:13.841974: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_10:0][315805]
2017-06-05 12:41:13.844600: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_2:0][1036]

Summing up indices sizes:
Model 1 - 22080 (small categorical features)
Model 2 - 92761 (small categorical features + 50k feature)
Model 3 - 377590 (small categorical features + 50k feature + 300K feature)

Summing up, feature cardinality size and indices size are correlated.

Small Note, regarding the gradient scopes - my_optimizer_0/gradients/concat* :

my_optimizer_0 - this is my code.
gradients/concat* - I think it comes from here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/gradients_impl.py

@allenlavoie
Copy link
Member

Well, that explains why the transfer times are in miliseconds: 3 megabytes of indices at ~1 gigabyte/sec, then take a round trip (even if they're all duplicate, the second result of unique() has the same size as its input).

There's a separate question of why that many embeddings are accessed at each iteration (all of them?). It's definitely unusual for a language model. Regardless, clearly it would be nice to have a GPU kernel for unique.

@ningyuwhut
Copy link

I met this problem too, so have it been solved?

@allenlavoie
Copy link
Member

@ningyuwhut I'd suggest using the workaround @KashiErez mentioned for now if you don't care about deduplicating sparse gradients and want the previous behavior. This was a bug fix, so we can't just go back to the old behavior by default.

AFAIK nobody is working on a GPU kernel for UniqueOp, but that still seems like the resolution here if you're interested in taking the bug.

@rmothukuru rmothukuru added the type:performance Performance Issue label May 24, 2021
@mohantym
Copy link
Contributor

mohantym commented Feb 4, 2022

Hi @KashiErez ! Sorry for the late response.
You are using older versions(1.x versions) of Tensorflow which is not supported any more. Could you check in latest versions (TF 2.7/2.8 )? A lot of bug fixes has been done and features has been added in latest version. Thanks!

@mohantym mohantym added the stat:awaiting response Status - Awaiting response from author label Feb 4, 2022
@mohantym mohantym self-assigned this Feb 4, 2022
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Feb 11, 2022
@skyson
Copy link

skyson commented Feb 11, 2022 via email

@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

7 participants