Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.vectorized_maps resolve fallbacks #55639

Open
bhack opened this issue Apr 15, 2022 · 32 comments
Open

tf.vectorized_maps resolve fallbacks #55639

bhack opened this issue Apr 15, 2022 · 32 comments
Assignees
Labels
comp:ops OPs related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests

Comments

@bhack
Copy link
Contributor

bhack commented Apr 15, 2022

Mirroring keras-team/keras-cv#291 for the TF core component.

System information

  • TensorFlow version (you are using):
    master
  • Are you willing to contribute it (Yes/No):
    I don't know. If I have a clear contribution path/pin-pointer probably.

Describe the feature and the current behavior/state.
Missing converters PFOR and eventually mitigate other fallbacks

Will this change the current api? How?
No

Who will benefit with this feature?
Performance on fallback

Any Other info.

/cc @wangpengmit

@bhack
Copy link
Contributor Author

bhack commented Apr 18, 2022

On the TF side I add also keras-team/keras-cv#258 (comment) as it will be the next one.

@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 20, 2022
@bhack
Copy link
Contributor Author

bhack commented Apr 27, 2022

Any feedback on what to do next?

@bhack
Copy link
Contributor Author

bhack commented Apr 28, 2022

I've just updated the fallback list from the current Keras-cv master:

WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting AdjustContrastv2 cause Input "contrast_factor" of op 'AdjustContrastv2' expected to be loop invariant.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting AdjustHue cause Input "delta" of op 'AdjustHue' expected to be loop invariant.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting AdjustSaturation cause Input "scale" of op 'AdjustSaturation' expected to be loop invariant.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting DepthwiseConv2dNative cause Input "filter" of op 'DepthwiseConv2dNative' expected to be not loop invariant.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting DynamicPartition cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting DynamicStitch cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting HistogramFixedWidth cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting RandomShuffle cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting StatelessRandomGetKeyCounter cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting StatelessRandomUniformFullIntV2 cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING  tensorflow:pfor.py:1082 Using a while_loop for converting StridedSlice cause Input "input" of op 'StridedSlice' expected to be not loop invariant.

@wangpengmit
Copy link
Member

We discussed it in our meeting but still don't have an owner for tf.vectorized_map yet. So the only feedback we can give at the moment is that contributions are welcomed. (CCing @rohan100jain)

@bhack
Copy link
Contributor Author

bhack commented Apr 28, 2022

@wangpengmit Thank you but I need to wait that we have a resource allocated to tf.vectorized_map to discuss a little bit in detail the next steps.

I want also try to minimize the risk that at some point I find out that it will no longer be actively supported or it is superseded by something else as we don't have a public visibility on TF roadmaps.

Please notify me on this ticket when you have a new resource allocated and I could help.

@wangpengmit
Copy link
Member

CCing @ishark .

@ishark
Copy link
Member

ishark commented Aug 31, 2022

Sorry for the late reply on this. Can you please help prioritize which ops would you like to have converters for first?

@bhack
Copy link
Contributor Author

bhack commented Aug 31, 2022

I think that we need to understand a little bit of things before looking at converters priorities:

I suppose no registered converter need to be considered as fallback fail fast case for vectorized_map.

As you know the origin of these are from Keras-cv so I'am more worried about the expected to be loop invariant. Cause I suppose that, just around the corner of the missing converter, we will have this kind of issue also on the other ops.

And my guess (it could be wrong) is that we are hitting an hard limit of vectorized_map related, under the hood, to the limits of the related native ops/kernels.

E.g. see:
keras-team/keras-cv#291 (comment)
keras-team/keras-cv#581 (comment)

@ishark What do you think?

/cc @LukeWood @qlzh727

@ishark
Copy link
Member

ishark commented Aug 31, 2022

Thanks for the prompt reply @bhack and for providing the context!

Agreed, it makes sense to understand the whole context before diving into solution.

So the way I understand, there are 2 main issues:

  1. Not all ops have vectorization support implemented in vectorized_map. The design of vectorized_map requires that we register each op converter explicitly. So, there are quite a few missing ops.
  2. There might be restrictions about which inputs are expected to be invariant. For example, in the AdjustHue op case, delta is expected to be a constant for all of the calls and the first input is vectorized. It might depend on each usecase, how the inputs should be treated. Perhaps, vectorized_map should provide a flexible mechanism for doing so.

In my opinion, 1 has a more straightforward solution, unless the op is not vectorizable. For 2 on the other hand, things can be a bit tricky and might depend on the usecase. One possible approach here might be to provide way to extend the vectorization converters to users, if general solution is not suitable. Can you confirm that in your delta needs to be a variant and actually has a different value for each call? Or is it coming from keras cv implementation?

@bhack
Copy link
Contributor Author

bhack commented Aug 31, 2022

Obviously 1 is always necessary as a pre-requisite but in this case I think we will end up in 2 by design.

Our variant parameters are mainly inherited by some design choices in the early days of the Keras-cv repo using a within the batch randomization of the TF op parameters.

To have more context please check in the original thread with the few messages around keras-team/keras-cv#146 (comment).

See also my last checklist in:
keras-team/keras-cv#581 (comment)

@bhack
Copy link
Contributor Author

bhack commented Sep 1, 2022

Just to clarify with another example the within the batch approach vs the nature of the TF native op:
The Keras-cv policy is "going to request" to vectorized_map a vectorization of a random/variable contrast_factor for every image in the images params (the batch).
The nature of the TF native operator/API currently is that it handle only a constant contrast_factor for all the images:

@RegisterPFor("AdjustContrastv2")
def _convert_adjust_contrastv2(pfor_input):
images = pfor_input.stacked_input(0)
contrast_factor = pfor_input.unstacked_input(1)
return wrap(gen_image_ops.adjust_contrastv2(images, contrast_factor), True)

@davidanoel
Copy link

The same is happening for RandAugment. Without RandAugment each epoch takes around 35s on my machine. With RandAugment it takes about 2mins 25 seconds. Any resolution on the roadmap?

@LukeWood
Copy link
Contributor

LukeWood commented Sep 30, 2022

The same is happening for RandAugment. Without RandAugment each epoch takes around 35s on my machine. With RandAugment it takes about 2mins 25 seconds. Any resolution on the roadmap

hello!

please provide repro code. Are you using it in a sequential model? You should use it in the tf.data pipeline

@davidanoel
Copy link

davidanoel commented Sep 30, 2022

@LukeWood I am using it within the tf.data pipeline. See snippets below.

from keras_cv.layers import RandAugment

(train_set, valid_set, test_set) ,info= tfds.load("mnist", split=["train[:90%]", "train[90%:]", "test"],as_supervised=True, with_info=True)

augmenter = keras_cv.layers.RandAugment(
    value_range=(0, 255),
    augmentations_per_image=3,
    magnitude=0.9,
    magnitude_stddev=0.2,
    rate=0.5,
)

def resize_and_rescale(image, label):
    # convert MNIST images to RGB
    image = tf.image.resize_with_pad(image, RESIZE_TO, RESIZE_TO)
    image = tf.tile(image, [1, 1, 3])
    image = tf.cast(image, tf.float32)
    label = tf.one_hot(label, num_classes)
    return image, label

def prepare_dataset(dataset, shuffle=False, augment=False):
    dataset = dataset.map(resize_and_rescale, num_parallel_calls=AUTO)
    if shuffle:
        dataset = dataset.shuffle(10 * BATCH_SIZE)
    dataset = dataset.batch(BATCH_SIZE)
    if augment:
        dataset = dataset.map(lambda x,y:(augmenter(x),y),num_parallel_calls=AUTO)
    return dataset.prefetch(AUTO)

train_dataset_aug = prepare_dataset(train_set,shuffle=True, augment=True)

@LukeWood
Copy link
Contributor

LukeWood commented Oct 1, 2022

@LukeWood I am using it within the tf.data pipeline. See snippets below.

from keras_cv.layers import RandAugment

(train_set, valid_set, test_set) ,info= tfds.load("mnist", split=["train[:90%]", "train[90%:]", "test"],as_supervised=True, with_info=True)

augmenter = keras_cv.layers.RandAugment(
    value_range=(0, 255),
    augmentations_per_image=3,
    magnitude=0.9,
    magnitude_stddev=0.2,
    rate=0.5,
)

def resize_and_rescale(image, label):
    # convert MNIST images to RGB
    image = tf.image.resize_with_pad(image, RESIZE_TO, RESIZE_TO)
    image = tf.tile(image, [1, 1, 3])
    image = tf.cast(image, tf.float32)
    label = tf.one_hot(label, num_classes)
    return image, label

def prepare_dataset(dataset, shuffle=False, augment=False):
    dataset = dataset.map(resize_and_rescale, num_parallel_calls=AUTO)
    if shuffle:
        dataset = dataset.shuffle(10 * BATCH_SIZE)
    dataset = dataset.batch(BATCH_SIZE)
    if augment:
        dataset = dataset.map(lambda x,y:(augmenter(x),y),num_parallel_calls=AUTO)
    return dataset.prefetch(AUTO)

train_dataset_aug = prepare_dataset(train_set,shuffle=True, augment=True)

I see. It’s not clear what the expected performance is of RandAugment; this might not be a bug but rather the cost of doing more computation - especially if your neural network is tiny. What’s your model code?

@bhack
Copy link
Contributor Author

bhack commented Oct 1, 2022

this might not be a bug but rather the cost of doing more computation - especially if your neural network is tiny. What’s your model code?

Yes It could still be a too small input/network that makes the forward and backward pass faster then the augmentation task and so the augmentation is the bottleneck but we had reported a quite heavy overhead also just with a single augmentation like Random contrast with EfficientnetB0 with CIFAR (ok not a so huge input size) at keras-team/keras-cv#581

The main problem is that we have never benchmarked in Keras-cv the performance drop of these vectorized map fallbakcs and our original choice/policy to randomize within the batch that it is one of the root cause of these fallbacks. I haven't still any specific reference about a paper that sustain why this training regime has gain on the loss/epoch. What I can claim for sure is that we have a performance drop that need to be benchmarked.

Then with @ishark we could check if there is any action item or we don't have too much pratical solutions if we want to maintain the within the batch augmentation policy other that changing the tf.image ops to support batched args. But other then an API break it will requires kernels and XLA lowerings/bridge rewriting (/cc @paynecl right?)

All this with an orphan tf.image API ownership (tensorflow/community#412 (comment))

@paynecl
Copy link
Contributor

paynecl commented Oct 3, 2022

'But other then an API break it will requires kernels and XLA lowerings/bridge rewriting' Possibly. Depends on how the low level op(s) in question is(are) affected.

@ishark
Copy link
Member

ishark commented Oct 4, 2022

Thanks @atlanticstarr1 and @bhack for adding details to the bug.

Regarding the code snippet for RandAugment, I did not see vectorized_map being used in the snippet. Tracing through the dataset.map method briefly, it seems like it uses custom dataset ops and not vectorized_map under the hood. @atlanticstarr1, could you please confirm the slowdown is related to vectorized_map? Otherwise we might want to track it as a separate issue.

Will look into a solution for invariant issue for pfor in this quarter and update. Thanks!

@bhack
Copy link
Contributor Author

bhack commented Oct 4, 2022

@ishark You can find it if you look at the base class of the image preprocessing layers:

https://github.com/keras-team/keras/blob/master/keras/layers/preprocessing/image_preprocessing.py#L278:L281

@bhack
Copy link
Contributor Author

bhack commented Oct 4, 2022

Will look into a solution for invariant issue for pfor in this quarter and update. Thanks!

With the within the batch augmentation policy we have (imho wrongly) tried to use vectorized_map to workaround the limits of the image ops that are working with an images (batch) but with a fixed transformation (whole batch) arg.

If we are still talking about covering the missing converter with invariant I don't think that we will solve this cases.

@ishark
Copy link
Member

ishark commented Oct 4, 2022

Thanks for the link to keras preprocessing layer @bhack.

I meant I would investigate the original issue of tf.vectorized_map fallbacks occurred during a few of the Keras cv preprocessing layers and check which ones we can resolve in the short term. It may not solve all the issues related to Keras CV + pfor usage, however, hopefully it will move us further along.

@bhack
Copy link
Contributor Author

bhack commented Oct 4, 2022

I meant I would investigate the original issue of tf.vectorized_map fallbacks occurred during a few of the Keras cv preprocessing layers and check which ones we can resolve in the short term.

Hopefully we can discuss it publicly as soon as you have an idea. My hypothesis is obviously that in this specific case the lack of converters is just a "facade" problem

@bhack
Copy link
Contributor Author

bhack commented Oct 7, 2022

The main problem is that we have never benchmarked in Keras-cv the performance drop of these vectorized map fallbakcs and our original choice/policy to randomize within the batch that it is one of the root cause of these fallbacks. I haven't still any specific reference about a paper that sustain why this training regime has gain on the loss/epoch. What I can claim for sure is that we have a performance drop that need to be benchmarked.

Please check my Colab performance test at keras-team/keras-cv#581 (comment)

@bhack
Copy link
Contributor Author

bhack commented Oct 7, 2022

/cc @martin-gorner

@ishark
Copy link
Member

ishark commented Oct 7, 2022

Thank you @bhack! This is very helpful.

@martin-gorner
Copy link

some design choices in the early days of the Keras-cv repo using a within the batch randomization

I confirm this is a deliberate product decision: data augmentation techniques that augment an entire batch of images in a similar way are potentially harmful. Many models train better with good stochasticity of the training data. Even inadequate shuffling of data within a batch can show up on learning curves. Data augmentation is supposed to increase the variance of input images. But having an entire batch augmented in the same way works against that.

These effects on training can be more or less visible, sometimes very subtle and therefore very difficult to pinpoint and debug. That's why we want to make sure this is a problem users of KerasCV augmentation layer do not have to worry about.

This discussion is not so much about numerical impact (of batches augmented in a similar way vs. with good intra-batch augmentation stochasticity) on this or that model. There might be evidence that this does or does not matter for some models. Model vary and the evidence will always be debated.

This product decision is about making this debate a moot point by offering a design that is always safe out of the box.

@bhack
Copy link
Contributor Author

bhack commented Nov 10, 2022

Yes I think we all understand this but we also know that if in the design phase we don't have planned the resources to rewrite the required TF ops and Kernels (+XLA) the performance gap was assured as we need to launch a kernel for every image in the batch.

Also in some specific known conditions it is hard to plan at all to refactor these specific API (see tensorflow/community#412 (comment))

In the early design days, as re-mentioned in keras-team/keras-cv#372 (comment), I've tried to argue, in the case we didn't plan to rewrite these ops/kernels, a sort of compromise between stochasticity and the time required for a training step:

Can we interpret the mentioned paper by randomizing the augmenting params in sub-batches repeating the same set of images? In this way we could populate the batch reducing the number of launched kernels and also reduce the IO traffic on the filesystem.

Maybe you have discussed this internally but in the "community" this has never been addressed.

On the practical side, if this idea does not convince the team, what other solution do we have in hand besides rewriting the ops in TF?

@martin-gorner
Copy link

I'm hearing you. Performance is an important consideration too. We will be discussing the tradeoff.

@bhack
Copy link
Contributor Author

bhack commented Nov 10, 2022

@martin-gorner Thanks, I hope that you could keep the discussion open in the community also if sometimes I understand it could be slower than an internal thread.

Currently I don't have the Cloud resource available to compile again TF and prepare some testing PRs.
E.g. Just to make a concrete example, also if I have not seen the full implementation in details, but just considering the CPU kernel in RandomContrast/AdjustContrast (see keras-team/keras-cv#581 (comment)) I think it could be not too tricky to use a Tensor instead of a scalar for the requirements of a RandomContrast within the batch augmentation:

const float factor_value = factor();
float* p = output.data();
const float* q = input.data();
for (int64_t n = 0; n < input.size(); ++n) {
p[n] += factor_value * (q[n] - p[n]);
}

Always considering that we are not going to accumulate a combinatorial explosion of TF ops to modify especially with the new 3d augmentation API and that we have at least enough resources to review PRs.

I think that instead for the missing converters it is a minor problem that can be handled more safely by the vectorized_map team without too much problems (or by community with PRs if we have enough reviewers).

@bhack
Copy link
Contributor Author

bhack commented Nov 12, 2022

@martin-gorner In the case it could help https://arxiv.org/abs/2210.06441

@bhack
Copy link
Contributor Author

bhack commented Jan 30, 2023

@martin-gorner What do you think about keras-team/keras-cv#1331 ?

@johnlundberg
Copy link

I'm seeing the same issue with PopulationCount

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:ops OPs related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests
Projects
None yet
Development

No branches or pull requests