Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning rate schedulers #56

Merged
merged 18 commits into from
Sep 15, 2022
Merged

Learning rate schedulers #56

merged 18 commits into from
Sep 15, 2022

Conversation

topepo
Copy link
Member

@topepo topepo commented Sep 14, 2022

Closes #12

The schedulers were made in R instead of using the torch functions. For constant rates, rate_schedule = "none" is used (and is the default).

@jonthegeek
Copy link

@dfalbel I'd like to help track down what's causing the output differences on Windows (I'm seeing it too and hadn't isolated yet whether it was CUDA or Windows, but this looks like it's Windows).

@topepo We're using {luz} in {tidybert} (relatively big changes are actively in progress so don't look too close at that before ~tomorrow), so I'm likely going to re-implement this idea with slight changes for the {luz} version. Any caveats to watch for (other than the OS differences)?

R/mlp-fit.R Show resolved Hide resolved
@dfalbel
Copy link
Collaborator

dfalbel commented Sep 15, 2022

PyTorch (and LibTorch) doesn't really ensure strong reproducibility across platforms and hardware.

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.

See eg: https://pytorch.org/docs/stable/notes/randomness.html

We don't have a wrapper for torch.use_deterministic_algorithms() though in R which could help isolating the issues.

@topepo
Copy link
Member Author

topepo commented Sep 15, 2022

It's really odd where/when the differences occur. The same snapshots run fine but then later have differences. Since all I have is a Mac, I'm going to isolate snapshots to that OS.

I guess when we add gpu support here I've have to figure out a way to test and develop with gpu capabilities.

@topepo
Copy link
Member Author

topepo commented Sep 15, 2022

@jonthegeek As mentioned above, it is very difficult to predict when/where the differences occur. It doesn't seem random but does change over time, even when the code does not.

@jonthegeek
Copy link

@jonthegeek As mentioned above, it is very difficult to predict when/where the differences occur. It doesn't seem random but does change over time, even when the code does not.

Yeah, we're running into that much already, although the cases we have are machine-stable so far. We use torch::torch_manual_seed() fairly liberally, though. I have a note in a test that it seemed to make things more stable to call it once at the top of the test and then again when getting results. Even with that I still get different results on my Windows PC + CUDA vs even the same machine running torch in a docker container... BUT it seems to be one set for my PC and a second set for non-Windows (I haven't gotten torch + CUDA to work in docker yet to finish isolating it past there, though).

@topepo topepo merged commit fe9ebff into main Sep 15, 2022
@topepo topepo deleted the schedulers branch September 15, 2022 18:32
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

learning rate schemes
3 participants