Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bwd_bypass kernel is empty on AMD HW #20

Closed
mattsinc opened this issue May 1, 2018 · 9 comments
Closed

bwd_bypass kernel is empty on AMD HW #20

mattsinc opened this issue May 1, 2018 · 9 comments

Comments

@mattsinc
Copy link

mattsinc commented May 1, 2018

I've been looking at the disassembly of the bwd_bypass kernel when run with MIOpen on AMD HW, and I'm noticing that the Bwd kernel for this run is empty:

(literally the entire program)
MIOpenNeuronBwd:
s_endpgm

I've checked the resultant code on multiple generations of AMD HW and multiple versions of MIOpen, and all of them are generating similar kernels. Thus, I was wondering if this is intentional? Is it really expected that the Bwd pass does nothing?

Thanks,
Matt

@mattsinc
Copy link
Author

mattsinc commented May 7, 2018

Wanted to ping you all about this again.

Matt

@shidong-ai
Copy link
Owner

@mattsinc Sorry that I just saw your message. The bypass layer is just moving data from one place to another. It is originally used as a template to develop new layers. So it is doing nothing.
Hope this helps.

Shi

@mattsinc
Copy link
Author

mattsinc commented May 10, 2018

Hi @doody1986,

Thanks for getting back to me. I don't know if I quite understand your response though. I went through the OpenCL kernel that is being used to generate this kernel (MIOpen/src/kernels/MIOpenNeuron.cl). It turns out that every single #if and #elif will evaluate to false given what the environment variables are set to for bwd_bypass. Given this, only the following lines of code will actually be used in the compilation:

    int x = get_global_id(0);
    _FLOAT_ bot_diff_data[MLO_READ_UNIT];
    _FLOAT_ top_diff_data[MLO_READ_UNIT];
    _FLOAT_ bot_data[MLO_READ_UNIT];
    _FLOAT_ top_data[MLO_READ_UNIT];

    for (int i = 0; i < MLO_READ_UNIT; ++i)
    {
           {
                top_diff_dat[i] = top_diff[dyOffset + x * MLO_READ_UNIT + i];
                bot_dat[i]      = bot_data[xOffset + x * MLO_READ_UNIT + i];
                top_dat[i]      = top_data[yOffset + x * MLO_READ_UNIT + i];
           }
    }
    for (int i = 0; i < MLO_READ_UNIT; ++i)
    {
         {
               bot_diff[dxOffset + x * MLO_READ_UNIT + i] = bot_diff_dat[i];
         }
    }

All of the above code is updating input args to the kernel based on the arrays in the kernel or setting the local arrays to values from the input args. However, these arrays aren't even used after being set so my guess is that the compiler optimizes out everything in the kernel; which explains why it's empty.

But your response indicates that that it should be moving data from one place to another -- if the code is being optimized out, that wouldn't happen. Are you expecting that the arrays will be outputted? Or are you truly expecting that the kernel won't do anything?

Thanks,
Matt

@mattsinc
Copy link
Author

@doody1986, wanted to ping you again about the above.

@shidong-ai
Copy link
Owner

@mattsinc Thanks for pointing this out. Really appreciate! I went back to the code and figured it out. For the bypass layer when MIOpen is enabled, I use the activation function of MIOpen with a special flag indicating bypassing. You can refer to the link here for more information https://rocmsoftwareplatform.github.io/MIOpen/doc/html/activation.html#miopenactivationmode-t
The code you present must be corresponding to this part.

What I explained before is corresponding to the Nvidia implementation rather than the one for AMD, so it was not correct. Sorry about that. But when cuDNN is enabled, the copy actually happens.
I was expecting this layer to bypass the input data without any computations involved. The way AMD does seems like a bug, but if we think of it as an in-place bypass, it is exactly the behavior - doing nothing.

May I ask why this is bothering you?
Thanks

Shi

@mattsinc
Copy link
Author

Hi @doody1986,

Thanks for getting back to me! I had noticed this because the profiler was telling me the kernel used 0 vector registers ... which seemed implausible. So I was trying to figure out how that was possible. The above explanation is helpful in that regard.

So is there a next step from here? Is it something wrong on your end? Do I need to open a ticket with the MIOpen folks? Is it just an optimization on the part of the MIOpen folks?

Thanks,
Matt

@shidong-ai
Copy link
Owner

@mattsinc

Since I am just calling this API from MIOpen, there isn't too much things I can do in DNNMark and I am not quite familiar with their implementation detail for now.
If you think this part might affect your own work, then opening a ticket might be a good option. This is only my personal suggestion though. :)

Shi

@mattsinc
Copy link
Author

@doody1986, thanks. It doesn't really affect my work, but it does bother me that there are known problems with unknown solutions. Anyways, will close.

Matt

@mattsinc
Copy link
Author

mattsinc commented Jun 6, 2018

FYI: ROCm/MIOpen#42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants