training = True causes a GPU memory error, asks for 53 GB on normal sized proteins #352

yoann-glanum · 2023-01-11T09:44:13Z

Hello,

Sorry for bothering again,

I've been looking into manipulating the function parameters of the batch run function in the colab, This one, and got a crash.

Trying to isolate the parameters I've changed, it seems just activating training = True is enough, here are the parameters, crash message, and the input sequence :

---- parameters

all default params except :

input and output dirs, but it finds the right files and seems to read them fine (output the right pdb names)
stop at score 90 (nevermind, tried the default 100 too)
training = True

---- crash message

2023-01-11 09:30:23,449 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_1JEV.fasta, ignoring all but the first sequence
2023-01-11 09:30:23,452 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_5KZT.fasta, ignoring all but the first sequence
2023-01-11 09:30:23,454 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_7JLS.fasta, ignoring all but the first sequence
2023-01-11 09:30:23,465 Found 5 citations for tools or databases
2023-01-11 09:30:27,958 Query 1/3: rcsb_pdb_1JEV (length 517)
2023-01-11 09:30:29,785 Running model_3
2023-01-11 09:30:52,301 Could not predict rcsb_pdb_1JEV. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes.
2023-01-11 09:30:52,329 Query 2/3: rcsb_pdb_5KZT (length 536)
2023-01-11 09:30:53,880 Running model_3
2023-01-11 09:31:15,591 Could not predict rcsb_pdb_5KZT. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes.
2023-01-11 09:31:15,622 Query 3/3: rcsb_pdb_7JLS (length 553)
2023-01-11 09:31:16,237 Running model_3
2023-01-11 09:31:32,319 Could not predict rcsb_pdb_7JLS. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes.
2023-01-11 09:31:32,348 Done

---- example of fasta file

>7JLS_1|Chain A|Probable periplasmic dipeptide-binding lipoprotein DppA|Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) (83332)
MAGLNDIFEAQKIEWHELEVLFQGPMSPDVVLVNGGEPPNPLIPTGTNDSNGGRIIDRLFAGLMSYDAVGKPSLEVAQSIESADNVNYRITVKPGWKFTDGSPVTAHSFVDAWNYGALSTNAQLQQHFFSPIEGFDDVAGAPGDKSRTTMSGLRVVNDLEFTVRLKAPTIDFTLRLGHSSFYPLPDSAFRDMAAFGRNPIGNGPYKLADGPAGPAWEHNVRIDLVPNPDYHGNRKPRNKGLRFEFYANLDTAYADLLSGNLDVLDTIPPSALTVYQRDLGDHATSGPAAINQTLDTPLRLPHFGGEEGRLRRLALSAAINRPQICQQIFAGTRSPARDFTARSLPGFDPNLPGNEVLDYDPQRARRLWAQADAISPWSGRYAIAYNADAGHRDWVDAVANSIKNVLGIDAVAAPQPTFAGFRTQITNRAIDSAFRAGWRGDYPSMIEFLAPLFTAGAGSNDVGYINPEFDAALAAAEAAPTLTESHELVNDAQRILFHDMPVVPLWDYISVVGWSSQVSNVTVTWNGLPDYENIVKAENLYFQGGHHHHHHHH
>7JLS_2|Chain B|Peptide SER-VAL-ALA|Escherichia coli BL21(DE3) (469008)
SVA

There are two chains but it warns that it only takes a single one (batch for multimer is another thing I have to look into), so it should run as monomer fine, especially with proteins of size 500 to 550.

The first thing I can think of that could do this was if the flag training = True caused to compute all the gradients for backprop, but by my understanding it's supposed to only activate the dropout layers ?

The text was updated successfully, but these errors were encountered:

sokrypton · 2023-01-11T12:38:58Z

By default, is_training=True also disables the memory saving components of the model. I guess one can go through and hard-code these to True across the code, or introduce low_memory option into system. We just need to confirm the low_memory option isn't interfering with how dropouts are distributed...

…

On Wed, Jan 11, 2023, 4:44 AM yoann-glanum ***@***.***> wrote: Hello, Sorry for bothering again, I've been looking into manipulating the function parameters of the batch run function in the colab, This one <https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/batch/AlphaFold2_batch.ipynb>, and got a crash. Trying to isolate the parameters I've changed, it seems just activating training = True is enough, here are the parameters, crash message, and the input sequence : ---- parameters all default params except : - input and output dirs, but it finds the right files and seems to read them fine (output the right pdb names) - stop at score 90 - training = True ---- crash message 2023-01-11 09:30:23,449 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_1JEV.fasta, ignoring all but the first sequence 2023-01-11 09:30:23,452 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_5KZT.fasta, ignoring all but the first sequence 2023-01-11 09:30:23,454 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_7JLS.fasta, ignoring all but the first sequence 2023-01-11 09:30:23,465 Found 5 citations for tools or databases 2023-01-11 09:30:27,958 Query 1/3: rcsb_pdb_1JEV (length 517) 2023-01-11 09:30:29,785 Running model_3 2023-01-11 09:30:52,301 Could not predict rcsb_pdb_1JEV. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes. 2023-01-11 09:30:52,329 Query 2/3: rcsb_pdb_5KZT (length 536) 2023-01-11 09:30:53,880 Running model_3 2023-01-11 09:31:15,591 Could not predict rcsb_pdb_5KZT. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes. 2023-01-11 09:31:15,622 Query 3/3: rcsb_pdb_7JLS (length 553) 2023-01-11 09:31:16,237 Running model_3 2023-01-11 09:31:32,319 Could not predict rcsb_pdb_7JLS. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes. 2023-01-11 09:31:32,348 Done ---- example of fasta file >7JLS_1|Chain A|Probable periplasmic dipeptide-binding lipoprotein DppA|Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) (83332) MAGLNDIFEAQKIEWHELEVLFQGPMSPDVVLVNGGEPPNPLIPTGTNDSNGGRIIDRLFAGLMSYDAVGKPSLEVAQSIESADNVNYRITVKPGWKFTDGSPVTAHSFVDAWNYGALSTNAQLQQHFFSPIEGFDDVAGAPGDKSRTTMSGLRVVNDLEFTVRLKAPTIDFTLRLGHSSFYPLPDSAFRDMAAFGRNPIGNGPYKLADGPAGPAWEHNVRIDLVPNPDYHGNRKPRNKGLRFEFYANLDTAYADLLSGNLDVLDTIPPSALTVYQRDLGDHATSGPAAINQTLDTPLRLPHFGGEEGRLRRLALSAAINRPQICQQIFAGTRSPARDFTARSLPGFDPNLPGNEVLDYDPQRARRLWAQADAISPWSGRYAIAYNADAGHRDWVDAVANSIKNVLGIDAVAAPQPTFAGFRTQITNRAIDSAFRAGWRGDYPSMIEFLAPLFTAGAGSNDVGYINPEFDAALAAAEAAPTLTESHELVNDAQRILFHDMPVVPLWDYISVVGWSSQVSNVTVTWNGLPDYENIVKAENLYFQGGHHHHHHHH >7JLS_2|Chain B|Peptide SER-VAL-ALA|Escherichia coli BL21(DE3) (469008) SVA ------------------------------ There are two chains but it warns that it only takes a single one (batch for multimer is another thing I have to look into), so it should run as monomer fine, especially with proteins of size 500 to 550. The first thing I can think of that could do this was if the flag training = True caused to compute all the gradients for backprop, but by my understanding it's supposed to only activate the dropout layers ? — Reply to this email directly, view it on GitHub <#352>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA76LAQY2BZULJ4QAUGRUF3WRZ6HRANCNFSM6AAAAAATX2FDPU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

yoann-glanum · 2023-01-11T12:46:47Z

Oh so that's the intended / normal way of functioning then?

I was just raising the issue because I thought it might be a recent bug or something similar ^^

sokrypton · 2023-01-11T12:50:31Z

Yes, this is intended. But I'm not sure if the low_memory flag (which is what is_training disables) is only needed for proper gradient computation (which isn't needed) or also needed for proper dropout distribution. If it's not needed for the latter, we can set it to True!

…

On Wed, Jan 11, 2023, 7:46 AM yoann-glanum ***@***.***> wrote: Oh so that's the intended / normal way of functioning then? I was just raising the issue because I thought it might be a recent bug or something similar ^^ — Reply to this email directly, view it on GitHub <#352 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA76LAXL4JF7DJV7BITQ733WR2TUDANCNFSM6AAAAAATX2FDPU> . You are receiving this because you commented.Message ID: ***@***.***>

yoann-glanum · 2023-01-11T13:00:30Z

I see I see,

It seems wild to me that the low_memory flag is the difference between 6-7 GB of GPU RAM in the default run on these sequences, and 53 GB, that's a beefy optimisation.

I was assuming the is_training flag would have just affected the layers in the inference, and that the gradient computation would be a fully separated step that would not be activated.

Unless dropout needs very particular ways of randomness for their activation, I don't see these layers requiring any more computation effort than an average proba distribution sampling?

(I don't particularly need that option for what I'm doing I'm just trying to test everything, but it's good to know there'd be a way to make it work)

sokrypton · 2023-01-19T18:47:36Z

alphafold v2.3.1 fixes this issue by using global_config to enable dropouts instead of using is_training (which also disables low-memory)
See commit here
google-deepmind/alphafold@f96e254

In the colabfold beta branch, I've switched to this setup to enable dropouts. Try it out:
https://colab.research.google.com/github/sokrypton/ColabFold/blob/beta/AlphaFold2.ipynb

yoann-glanum · 2023-01-25T10:38:47Z

(Sorry I had to be off work for a few days)

Tested it on the v3, it does seem to run fine now, that's great for more in depth exploration of conformations (when runtimes allow it 😬 ).

I've also tried out some of the other new options, the automatic 20 recycles with 0.5 recycle tolerance, 2 seeds, save_all, and had a few thoughts :

does the recycle tolerance have the same interpretation as the inter-models stop_at_score ? i.e a tolerance of 0.9 means we stop the recycles when we reach [metric] = 0.9? (instinctively a tolerance would go the other way, with tol = 0 <=> stop_at = 100)
the recycle tolerance seem to go straight to the load models functions, unlike the rank_by or stop_at that exist just in the batch.py and seem quite simple to manipulate, is it possible to choose what (plddt, ptm, iptm) the tolerance looks at?
other than not being sure what the tolerance was looking at, i do have different models ending with different recycle numbers
seeds seem to work as well
save_all works, at least in the local colab file folder and not the zip or other (though that fits with the script implementation so I assume it's intended)

Boy do I have a few things to catch up to in my colab / fork

sokrypton · 2023-01-27T16:45:23Z

Thanks for the report! The save_all option should be saving to output zip, I'll fix this!
tolerance is a metric was introduced by deepmind in v2.3.1, so we dont really want to mess with this at the moment. What this does is monitor the RMSD between recycles and as soon as the model converges it terminates. To stop_at is also doing this, but it also terminates when any of the models hit a specific score. Maybe this should be an option...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training = True causes a GPU memory error, asks for 53 GB on normal sized proteins #352

training = True causes a GPU memory error, asks for 53 GB on normal sized proteins #352

yoann-glanum commented Jan 11, 2023 •

edited

Loading

sokrypton commented Jan 11, 2023 via email

yoann-glanum commented Jan 11, 2023

sokrypton commented Jan 11, 2023 via email •

edited

Loading

yoann-glanum commented Jan 11, 2023

sokrypton commented Jan 19, 2023

yoann-glanum commented Jan 25, 2023

sokrypton commented Jan 27, 2023

training = True causes a GPU memory error, asks for 53 GB on normal sized proteins #352

training = True causes a GPU memory error, asks for 53 GB on normal sized proteins #352

Comments

yoann-glanum commented Jan 11, 2023 • edited Loading

sokrypton commented Jan 11, 2023 via email

yoann-glanum commented Jan 11, 2023

sokrypton commented Jan 11, 2023 via email • edited Loading

yoann-glanum commented Jan 11, 2023

sokrypton commented Jan 19, 2023

yoann-glanum commented Jan 25, 2023

sokrypton commented Jan 27, 2023

yoann-glanum commented Jan 11, 2023 •

edited

Loading

sokrypton commented Jan 11, 2023 via email •

edited

Loading