Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New and better implementation of activations #6

Merged
merged 4 commits into from
Jun 19, 2020

Conversation

PallHaraldsson
Copy link
Contributor

No description provided.

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Jun 17, 2020

[max (for ReLU) is surprisingly short (fast) assembly code, I'm a bit puzzled why equivalent if is longer code (while @btime showed same speed, but not important as I'm not changing that).]

But when you have min and max together (for CELU), I timed my implementation at least 20% faster, and for one side the original 583 times slower:

onnx/onnx#2575 (comment)

Maybe I'm missing something, and min and max are better for say GPUs, for some dataflow reasons? Maybe you can time it? I'm not even sure you support GPUs, and then my shouldn't be slower. Mish could be optimized, I just have the definition there.

@PallHaraldsson
Copy link
Contributor Author

Is Mish completely useless without a derivative? It's here on page 2:

https://arxiv.org/pdf/1908.08681v1.pdf

I just did most basic plu (not parameterized, could be easily added):
https://arxiv.org/pdf/1809.09534.pdf

Maybe credits need to be in the code, or ok here only in the PR? Move to top comment before you merge?

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Jun 17, 2020

Do you know, I was thinking of ways to speed up functions, so are extreme values common as inputs for activation functions?

They don't even have to be that common, I can approximate with the identity function for high, even exact for higher then 9.0 (and use 0.0 for all low values):

julia> mish(9f0)-9f0
0.0f0

julia> mish(19.1)-19.1
0.0

@sylvaticus
Copy link
Owner

Thank you!
I think it would be interesting to put the references as comments in the code and parametrise the plu function with α and c (using 0.1 and 1 as defaults). As people can use the defaults, it doesn't add any complicatedness, while being more flexible for interested users.
It would also be great to have dcelu, dmish, dsoftplus, as it would allow people to use them without having to rely to AD.
Let me know if you prefer doing this or I merge this pull request and then I'll do, as you prefer.
By the way I want change the name of the derivatives from dfunction to ∇function`, but I'll do this later for all the functions at the same time..

@PallHaraldsson
Copy link
Contributor Author

without having to rely to AD

ok, good to know, works as is, with a downside. I mean I know what AD is, somewhat, just not sure exactly. Assume a runtime cost, and symbolic better if you have it.

I tried for fun:
https://www.derivative-calculator.net/

and it doesn't handle softplus(x) and more, but e.g. gave sech(x)^2 so I timed it (for some inputs no change, or hopefully never slower, or I assume it wouldn't be in the standard library):

julia> x = 10.0; @btime dtanh($x);
  35.581 ns (0 allocations: 0 bytes)

julia> f(x)=sech(x)^2; @btime f($x);
  30.643 ns (0 allocations: 0 bytes)

You didn't answer if used on GPUs, I don't know about tanh and sech there, but these are never in hardware that I know of, while tanh is pretty common for ANNs, so who knows for that only? I hate to be optimizing, but making slower for GPUs.

I'm just busy right now, so more, like dmish, will have to wait.

@PallHaraldsson
Copy link
Contributor Author

Your call what do do for casing, e.g. Plus, and other upper casing inconsistencies: softMax, dSoftMax, [..] softplus

I also add spaces myself where I think clearer, and also changed dSoftMax (including visible docs, but then got second thoughts about that, if you skip spaces intentionally.

@sylvaticus
Copy link
Owner

Hello, thank you for your contribution.
I am accepting this pull request, but I will then gonna make some changes:

  • in softmax the default parameter should be β=one(x[1]) or β=one.(x) (it doesn't compile like this)
  • i prefer optional parameters to remain keyword arguments
  • celu doesn't return the same output for a alpha parameter different than 1, e.g. celu(2,α=2)
  • plu doesn't return the same output for a parameter different than 0.1, e.g. plu(4,α=10)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants