Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training = True causes a GPU memory error, asks for 53 GB on normal sized proteins #352

Open
yoann-glanum opened this issue Jan 11, 2023 · 7 comments

Comments

@yoann-glanum
Copy link

yoann-glanum commented Jan 11, 2023

Hello,

Sorry for bothering again,

I've been looking into manipulating the function parameters of the batch run function in the colab, This one, and got a crash.

Trying to isolate the parameters I've changed, it seems just activating training = True is enough, here are the parameters, crash message, and the input sequence :

---- parameters

all default params except :

  • input and output dirs, but it finds the right files and seems to read them fine (output the right pdb names)
  • stop at score 90 (nevermind, tried the default 100 too)
  • training = True

---- crash message

2023-01-11 09:30:23,449 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_1JEV.fasta, ignoring all but the first sequence
2023-01-11 09:30:23,452 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_5KZT.fasta, ignoring all but the first sequence
2023-01-11 09:30:23,454 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_7JLS.fasta, ignoring all but the first sequence
2023-01-11 09:30:23,465 Found 5 citations for tools or databases
2023-01-11 09:30:27,958 Query 1/3: rcsb_pdb_1JEV (length 517)
2023-01-11 09:30:29,785 Running model_3
2023-01-11 09:30:52,301 Could not predict rcsb_pdb_1JEV. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes.
2023-01-11 09:30:52,329 Query 2/3: rcsb_pdb_5KZT (length 536)
2023-01-11 09:30:53,880 Running model_3
2023-01-11 09:31:15,591 Could not predict rcsb_pdb_5KZT. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes.
2023-01-11 09:31:15,622 Query 3/3: rcsb_pdb_7JLS (length 553)
2023-01-11 09:31:16,237 Running model_3
2023-01-11 09:31:32,319 Could not predict rcsb_pdb_7JLS. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes.
2023-01-11 09:31:32,348 Done

---- example of fasta file

>7JLS_1|Chain A|Probable periplasmic dipeptide-binding lipoprotein DppA|Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) (83332)
MAGLNDIFEAQKIEWHELEVLFQGPMSPDVVLVNGGEPPNPLIPTGTNDSNGGRIIDRLFAGLMSYDAVGKPSLEVAQSIESADNVNYRITVKPGWKFTDGSPVTAHSFVDAWNYGALSTNAQLQQHFFSPIEGFDDVAGAPGDKSRTTMSGLRVVNDLEFTVRLKAPTIDFTLRLGHSSFYPLPDSAFRDMAAFGRNPIGNGPYKLADGPAGPAWEHNVRIDLVPNPDYHGNRKPRNKGLRFEFYANLDTAYADLLSGNLDVLDTIPPSALTVYQRDLGDHATSGPAAINQTLDTPLRLPHFGGEEGRLRRLALSAAINRPQICQQIFAGTRSPARDFTARSLPGFDPNLPGNEVLDYDPQRARRLWAQADAISPWSGRYAIAYNADAGHRDWVDAVANSIKNVLGIDAVAAPQPTFAGFRTQITNRAIDSAFRAGWRGDYPSMIEFLAPLFTAGAGSNDVGYINPEFDAALAAAEAAPTLTESHELVNDAQRILFHDMPVVPLWDYISVVGWSSQVSNVTVTWNGLPDYENIVKAENLYFQGGHHHHHHHH
>7JLS_2|Chain B|Peptide SER-VAL-ALA|Escherichia coli BL21(DE3) (469008)
SVA

There are two chains but it warns that it only takes a single one (batch for multimer is another thing I have to look into), so it should run as monomer fine, especially with proteins of size 500 to 550.

The first thing I can think of that could do this was if the flag training = True caused to compute all the gradients for backprop, but by my understanding it's supposed to only activate the dropout layers ?

@sokrypton
Copy link
Owner

sokrypton commented Jan 11, 2023 via email

@yoann-glanum
Copy link
Author

Oh so that's the intended / normal way of functioning then?

I was just raising the issue because I thought it might be a recent bug or something similar ^^

@sokrypton
Copy link
Owner

sokrypton commented Jan 11, 2023 via email

@yoann-glanum
Copy link
Author

I see I see,

It seems wild to me that the low_memory flag is the difference between 6-7 GB of GPU RAM in the default run on these sequences, and 53 GB, that's a beefy optimisation.

I was assuming the is_training flag would have just affected the layers in the inference, and that the gradient computation would be a fully separated step that would not be activated.

Unless dropout needs very particular ways of randomness for their activation, I don't see these layers requiring any more computation effort than an average proba distribution sampling?

(I don't particularly need that option for what I'm doing I'm just trying to test everything, but it's good to know there'd be a way to make it work)

@sokrypton
Copy link
Owner

alphafold v2.3.1 fixes this issue by using global_config to enable dropouts instead of using is_training (which also disables low-memory)
See commit here
google-deepmind/alphafold@f96e254

In the colabfold beta branch, I've switched to this setup to enable dropouts. Try it out:
https://colab.research.google.com/github/sokrypton/ColabFold/blob/beta/AlphaFold2.ipynb

@yoann-glanum
Copy link
Author

(Sorry I had to be off work for a few days)

Tested it on the v3, it does seem to run fine now, that's great for more in depth exploration of conformations (when runtimes allow it 😬 ).

I've also tried out some of the other new options, the automatic 20 recycles with 0.5 recycle tolerance, 2 seeds, save_all, and had a few thoughts :

  • does the recycle tolerance have the same interpretation as the inter-models stop_at_score ? i.e a tolerance of 0.9 means we stop the recycles when we reach [metric] = 0.9? (instinctively a tolerance would go the other way, with tol = 0 <=> stop_at = 100)
  • the recycle tolerance seem to go straight to the load models functions, unlike the rank_by or stop_at that exist just in the batch.py and seem quite simple to manipulate, is it possible to choose what (plddt, ptm, iptm) the tolerance looks at?
  • other than not being sure what the tolerance was looking at, i do have different models ending with different recycle numbers
  • seeds seem to work as well
  • save_all works, at least in the local colab file folder and not the zip or other (though that fits with the script implementation so I assume it's intended)

Boy do I have a few things to catch up to in my colab / fork

@sokrypton
Copy link
Owner

Thanks for the report! The save_all option should be saving to output zip, I'll fix this!
tolerance is a metric was introduced by deepmind in v2.3.1, so we dont really want to mess with this at the moment. What this does is monitor the RMSD between recycles and as soon as the model converges it terminates. To stop_at is also doing this, but it also terminates when any of the models hit a specific score. Maybe this should be an option...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants