Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTroch-Lightning Version Update #104

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

TaleirOfDeynai
Copy link

What this does:

This PR works to resolve all deprecation warnings and hard incompatibilities with current versions of PyTorch-Lightning. It also sets 1.7.7 as the requested version in environment.yaml.

Switches from the now completely dead and gone TestTubeLogger to the now Lightning default TensorBoardLogger. If you need to use another logger, you should be able to set it up in the configs. It seems to work more or less the same as TestTubeLogger, in-so-far as how Textual Inversion was using it.

There are a few other minor code adjustments:

  • Shifting things to better locations in Lightning modules.
    • Like how the keyboard interrupt checkpointing is handled.
    • Or how the loaded dataset info is reported to console.
  • Resolving some warnings displayed by the IDE I found in, probably, unused code-paths.
  • Adding a .gitignore to exclude a few things that have appeared while using TI and probably shouldn't be committed.

What this does not do:

The reason for all these deprecation warnings was they were shifting towards a hardware agnostic API. In order to become hardware agnostic, we'd need to apply a few additional changes here and there. I'm leaving that to someone else to resolve, if it interests them.

I don't know if this will run on other accelerators besides GPU and maybe CPU in its current state, but someone else can make those changes if it interests them.

Why?

My AMD system is a house of cards when it comes to compute, and it was having difficulty inter-operating with Anaconda. Having to run on the global installation of Python and it's abysmal package management, I needed to bring Textual Inversion up to date so it was not fighting with other Stable Diffusion libraries that were keeping up with their dependencies.

There's been some big updates to the ROCm stack lately, so maybe I can now use Anaconda!? ...but that was after I had already started this journey. Doing dependency upkeep (especially on your core framework) is a good thing anyways, so here's the PR!

Additionally, after updating and using the new strategies and accelerators system, I got a 60% performance boost, because the hard-coded DDP mode was detrimental to my single-GPU setup. PyTorch-Lightning has actually gotten quite good at auto-detecting the best execution method for compute code, so I left it un-opinionated and updated the readme to demonstrate the --accellerator gpu flag, which probably isn't even needed... But since I have not tested the other accelerators besides GPU, it's the one to put there in the demo.

If you really want to force the DDP strategy, you can use --strategy ddp to set it up.

What needs special review attention:

I am not setup for anything besides Stable Diffusion and I'm frankly afraid to jostle this fragile setup trying to test training for other models. I would appreciate it if someone who is setup to test the autoencoder and latent-diffusion configs could please give this branch a try and make sure no deprecation warnings appear for an epoch or two of training.

Also, moves the printing of `datasets` to the `prepare_data` function.

Lightning does handle the calling of those.  The problem was that the `datasets` property does not exist until `setup` is called, so it was the wrong place to print this information.
In one place, an unqualified module access and in another, some code was rendered inaccessible when someone added a `raise`.

Neither of these codepaths are probably relevant, but I want them fixed up anyways.
I have been using this function to get feedback on the relative complexity of the embedding run-to-run by recording it in the logger, but something, somewhere, is calling `backward` on it after a little more than 12100 steps.

This makes sure it has a gradient.
`self.logvar` is on the CPU while the `t` given is usually created on `self.device` which is likely different.  1.12 would have allowed this indexing followed by the transfer, but 1.13 is a little more strict.  Fixing this by transferring the whole tensor before performing the indexing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant