Performance degradation with large lookup tables - optimizer._apply_sparse_duplicate_indices (TF V1.0.1) #10270

KashiErez · 2017-05-29T08:56:55Z

Hi,

I ran into this performance issue while trying to upgrade tensorflow from version 0.12.1 to 1.X.

We ran a network with large embedding lookup tables:

100K X 32 (for example, word embedding - with 100K unique words)
300K X 128 (for example, categorical feature with cardinality of 300K unique items)

After upgrading TF version to 1.0.1, GPU usage dropped in from 60% to 30%.
Training time went up in 50%-200% (depends on how big is the embedding lookup table).

This is the commit that caused the performance degradation:
f9f56f9

The handling of unique indexes is very slow and does not run in parallel with others operations.
Please note the big unique blocks in the middle.

Here is a work around (not handling unique indexes ):

class MyOptimizer(tf.train.AdamOptimizer):
        def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8,
               use_locking=False, name="Adam"):
                super(MyOptimizer,self).__init__(learning_rate,beta1, beta2, epsilon, use_locking,name)

        def _apply_sparse_duplicate_indices(self, grad, var):
                return self._apply_sparse(grad, var)

Thanks,
Erez

The text was updated successfully, but these errors were encountered:

KashiErez · 2017-05-29T08:57:40Z

Forgot to mention - this is the GPU info:

$nvidia-smi
Mon May 29 08:50:48 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      Off  | 0000:03:00.0     Off |                    0 |
| N/A   23C    P8    16W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      Off  | 0000:82:00.0     Off |                    0 |
| N/A   21C    P8    17W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

KashiErez · 2017-05-29T08:59:34Z

I using docker base image:

nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04

nvidia-docker --version
Docker version 1.12.5, build 7392c3b

KashiErez · 2017-05-29T09:04:27Z

This is how a good trace looks like: (after applying downgrading to version 12.0.1 or applying the workaround)

Note that there is higher parallelism.
No big 'unique block' in the middle.

allenlavoie · 2017-06-01T04:55:28Z

Interesting timelines, and sorry you're running into this. Some/most of this time is likely copying to host memory: we don't actually have a GPU kernel for unique, but one needed to be registered to avoid interfering with op placements. The computation happens on the CPU.

So this could be sped up by implementing a real GPU kernel for Unique. That's likely preferable to fusing Adam's sparse updates into a GPU kernel, although it is another possibility.

So any interest in writing a GPU kernel for unique? I don't know of anyone who is working on one right now.

allenlavoie · 2017-06-01T19:55:16Z

@zhangyaobit @yzhwang since we discussed this.

@KashiErez: One useful clarification would be how many indices are going into the UniqueOp, which should be equal to the number of embeddings that are accessed in each iteration (e.g. sentence length).

allenlavoie · 2017-06-01T20:28:38Z

Also @ekelsen, who has been thinking about using CUB as a way to implement these kinds of ops on the GPU.

ekelsen · 2017-06-01T22:21:23Z

Yeah, hopefully CUB will be usable from TF soon. In that case, unique can be done by sorting and then doing run-length-encoding.

KashiErez · 2017-06-02T06:10:06Z

Regarding Question: @KashiErez: One useful clarification would be how many indices are going into the UniqueOp, which should be equal to the number of embeddings that are accessed in each iteration (e.g. sentence length).

Answer:

The batch size is 1024.

Regarding words:
The sentence length is 21.
Number if unique words ~ 100K

But We have another categorical feature that has only one value in each iteration.
And the number of unique values is ~ 300K.

From the trace you can see that this categorical feature 'unique op' is running much slower then the word embedding 'unique op'.

So I think that the parameter you should look at first is 'number of unique values' (== lookup table size).

allenlavoie · 2017-06-02T15:05:56Z

In that case, could you add a print node to get the exact shape of the Tensor going into unique?

The whole idea of this code path is that the gradients are sparse; the IndexedSlices from the gradient of the embedding lookup has a number of indices equal to the number of embeddings which were actually accessed (which it sounds like should be ~21504 and ~1024?), independent of the size of the embedding lookup table.

KashiErez · 2017-06-04T07:26:31Z

Hi,

Not sure I understand where to add the print node.
'_apply_sparse_duplicate_indices' gets two parameters: grad, var
Should I print this tensor shapes?

KashiErez · 2017-06-05T13:08:26Z

Hi,

This is the optimizer I used to print the indices shape:
(I copied the code from optimizer.Optimizer class, and injected a tf.Print statement).

import tensorflow as tf
from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import math_ops


def _my_deduplicate_indexed_slices(values, indices, grad):
    """Sums `values` associated with any non-unique `indices`.
    Args:
      values: A `Tensor` with rank >= 1.
      indices: A one-dimensional integer `Tensor`, indexing into the first
        dimension of `values` (as in an IndexedSlices object).
    Returns:
      A tuple of (`summed_values`, `unique_indices`) where `unique_indices` is a
      de-duplicated version of `indices` and `summed_values` contains the sum of
      `values` slices associated with each unique index.
    """

    indices = tf.Print(indices,
                       data=[grad.name, tf.shape(indices)],
                       message='$$$$$$$$ indices.shape   ----- ',
                       first_n=1)

    unique_indices, new_index_positions = array_ops.unique(indices)
    summed_values = math_ops.unsorted_segment_sum(
        values, new_index_positions,
        array_ops.shape(unique_indices)[0])
    return (summed_values, unique_indices)


class MyAdamOptimizer(tf.train.AdamOptimizer):
    """
    This Optimizer is a workaround for this issue:
    https://github.com/tensorflow/tensorflow/issues/10270
    """
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8,
                 use_locking=False, name="MyAdam"):
        super(MyAdamOptimizer, self).__init__(learning_rate, beta1, beta2, epsilon, use_locking, name)

    def _apply_sparse_duplicate_indices(self, grad, var):
        """
        overriding method to avoid 'unique indexes' logic.
        'unique indexes' logic produce performance degradation in large lookup tables
        """


        summed_values, unique_indices = _my_deduplicate_indexed_slices(
            values=grad.values, indices=grad.indices, grad=grad)

        gradient_no_duplicate_indices = ops.IndexedSlices(
            indices=unique_indices,
            values=summed_values,
            dense_shape=grad.dense_shape)
        return self._apply_sparse(gradient_no_duplicate_indices, var)

I ran it on 3 models, here are the results:

Model 1 - few small categorical features (no more then 10K each):

2017-06-05 12:49:30.233421: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_8:0][10266]
2017-06-05 12:49:30.234021: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_4:0][1058]
2017-06-05 12:49:30.233421: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat:0][8687]
2017-06-05 12:49:30.233421: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_6:0][1033]
2017-06-05 12:49:30.237197: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_2:0][1036]

Model 2 - small categorical features (no more then 10K each) + words (50K unique values):

2017-06-05 11:39:45.333000: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_8:0][10266]
2017-06-05 11:39:45.332994: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_4:0][1058]
2017-06-05 11:39:45.332984: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_2:0][1036]
2017-06-05 11:39:45.333465: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat:0][8687]
2017-06-05 11:39:45.333727: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_6:0][1033]
2017-06-05 11:39:45.333756: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [optimizer/gradients/concat_10:0][70681]

Model 3 - small categorical features (no more then 10K each) + words (50K unique values) + big categorical feature (300K unique values):

2017-06-05 12:41:13.841337: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_6:0][1033]
2017-06-05 12:41:13.842412: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_4:0][1057]
2017-06-05 12:41:13.840662: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_8:0][9131]
2017-06-05 12:41:13.841332: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_12:0][40896]
2017-06-05 12:41:13.841334: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat:0][8632]
2017-06-05 12:41:13.841974: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_10:0][315805]
2017-06-05 12:41:13.844600: I tensorflow/core/kernels/logging_ops.cc:79] $$$$$$$$ indices.shape   ----- [my_optimizer_0/gradients/concat_2:0][1036]

Summing up indices sizes:
Model 1 - 22080 (small categorical features)
Model 2 - 92761 (small categorical features + 50k feature)
Model 3 - 377590 (small categorical features + 50k feature + 300K feature)

Summing up, feature cardinality size and indices size are correlated.

Small Note, regarding the gradient scopes - my_optimizer_0/gradients/concat* :

my_optimizer_0 - this is my code.
gradients/concat* - I think it comes from here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/gradients_impl.py

allenlavoie · 2017-06-05T16:18:18Z

Well, that explains why the transfer times are in miliseconds: 3 megabytes of indices at ~1 gigabyte/sec, then take a round trip (even if they're all duplicate, the second result of unique() has the same size as its input).

There's a separate question of why that many embeddings are accessed at each iteration (all of them?). It's definitely unusual for a language model. Regardless, clearly it would be nice to have a GPU kernel for unique.

ningyuwhut · 2018-08-11T05:12:08Z

I met this problem too, so have it been solved?

allenlavoie · 2018-08-13T16:28:31Z

@ningyuwhut I'd suggest using the workaround @KashiErez mentioned for now if you don't care about deduplicating sparse gradients and want the previous behavior. This was a bug fix, so we can't just go back to the old behavior by default.

AFAIK nobody is working on a GPU kernel for UniqueOp, but that still seems like the resolution here if you're interested in taking the bug.

mohantym · 2022-02-04T04:55:31Z

Hi @KashiErez ! Sorry for the late response.
You are using older versions(1.x versions) of Tensorflow which is not supported any more. Could you check in latest versions (TF 2.7/2.8 )? A lot of bug fixes has been done and features has been added in latest version. Thanks!

google-ml-butler · 2022-02-11T05:13:41Z

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

skyson · 2022-02-11T05:15:59Z

您的邮件已经收到--阿杰

google-ml-butler · 2022-02-18T05:45:48Z

Closing as stale. Please reopen if you'd like to work on this further.

michaelisard assigned allenlavoie Jun 6, 2017

rmothukuru added the type:performance Performance Issue label May 24, 2021

mohantym added the stat:awaiting response Status - Awaiting response from author label Feb 4, 2022

mohantym self-assigned this Feb 4, 2022

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Feb 11, 2022

google-ml-butler bot closed this as completed Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation with large lookup tables - optimizer._apply_sparse_duplicate_indices (TF V1.0.1) #10270

Performance degradation with large lookup tables - optimizer._apply_sparse_duplicate_indices (TF V1.0.1) #10270

KashiErez commented May 29, 2017 •

edited

KashiErez commented May 29, 2017 •

edited

KashiErez commented May 29, 2017

KashiErez commented May 29, 2017 •

edited

allenlavoie commented Jun 1, 2017

allenlavoie commented Jun 1, 2017

allenlavoie commented Jun 1, 2017

ekelsen commented Jun 1, 2017

KashiErez commented Jun 2, 2017 •

edited

allenlavoie commented Jun 2, 2017

KashiErez commented Jun 4, 2017

KashiErez commented Jun 5, 2017 •

edited

allenlavoie commented Jun 5, 2017

ningyuwhut commented Aug 11, 2018

allenlavoie commented Aug 13, 2018

mohantym commented Feb 4, 2022

google-ml-butler bot commented Feb 11, 2022

skyson commented Feb 11, 2022 via email

google-ml-butler bot commented Feb 18, 2022

Performance degradation with large lookup tables - optimizer._apply_sparse_duplicate_indices (TF V1.0.1) #10270

Performance degradation with large lookup tables - optimizer._apply_sparse_duplicate_indices (TF V1.0.1) #10270

Comments

KashiErez commented May 29, 2017 • edited

KashiErez commented May 29, 2017 • edited

KashiErez commented May 29, 2017

KashiErez commented May 29, 2017 • edited

allenlavoie commented Jun 1, 2017

allenlavoie commented Jun 1, 2017

allenlavoie commented Jun 1, 2017

ekelsen commented Jun 1, 2017

KashiErez commented Jun 2, 2017 • edited

allenlavoie commented Jun 2, 2017

KashiErez commented Jun 4, 2017

KashiErez commented Jun 5, 2017 • edited

allenlavoie commented Jun 5, 2017

ningyuwhut commented Aug 11, 2018

allenlavoie commented Aug 13, 2018

mohantym commented Feb 4, 2022

google-ml-butler bot commented Feb 11, 2022

skyson commented Feb 11, 2022 via email

google-ml-butler bot commented Feb 18, 2022

KashiErez commented May 29, 2017 •

edited

KashiErez commented May 29, 2017 •

edited

KashiErez commented May 29, 2017 •

edited

KashiErez commented Jun 2, 2017 •

edited

KashiErez commented Jun 5, 2017 •

edited