-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training = True causes a GPU memory error, asks for 53 GB on normal sized proteins #352
Comments
By default, is_training=True also disables the memory saving components of
the model.
I guess one can go through and hard-code these to True across the code, or
introduce low_memory option into system.
We just need to confirm the low_memory option isn't interfering with how
dropouts are distributed...
…On Wed, Jan 11, 2023, 4:44 AM yoann-glanum ***@***.***> wrote:
Hello,
Sorry for bothering again,
I've been looking into manipulating the function parameters of the batch
run function in the colab, This one
<https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/batch/AlphaFold2_batch.ipynb>,
and got a crash.
Trying to isolate the parameters I've changed, it seems just activating training
= True is enough, here are the parameters, crash message, and the input
sequence :
---- parameters
all default params except :
- input and output dirs, but it finds the right files and seems to
read them fine (output the right pdb names)
- stop at score 90
- training = True
---- crash message
2023-01-11 09:30:23,449 More than one sequence in
/content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_1JEV.fasta,
ignoring all but the first sequence
2023-01-11 09:30:23,452 More than one sequence in
/content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_5KZT.fasta,
ignoring all but the first sequence
2023-01-11 09:30:23,454 More than one sequence in
/content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_7JLS.fasta,
ignoring all but the first sequence
2023-01-11 09:30:23,465 Found 5 citations for tools or databases
2023-01-11 09:30:27,958 Query 1/3: rcsb_pdb_1JEV (length 517)
2023-01-11 09:30:29,785 Running model_3
2023-01-11 09:30:52,301 Could not predict rcsb_pdb_1JEV. Not Enough GPU
memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate
53061779456 bytes.
2023-01-11 09:30:52,329 Query 2/3: rcsb_pdb_5KZT (length 536)
2023-01-11 09:30:53,880 Running model_3
2023-01-11 09:31:15,591 Could not predict rcsb_pdb_5KZT. Not Enough GPU
memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate
53061779456 bytes.
2023-01-11 09:31:15,622 Query 3/3: rcsb_pdb_7JLS (length 553)
2023-01-11 09:31:16,237 Running model_3
2023-01-11 09:31:32,319 Could not predict rcsb_pdb_7JLS. Not Enough GPU
memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate
53061779456 bytes.
2023-01-11 09:31:32,348 Done
---- example of fasta file
>7JLS_1|Chain A|Probable periplasmic dipeptide-binding lipoprotein DppA|Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) (83332)
MAGLNDIFEAQKIEWHELEVLFQGPMSPDVVLVNGGEPPNPLIPTGTNDSNGGRIIDRLFAGLMSYDAVGKPSLEVAQSIESADNVNYRITVKPGWKFTDGSPVTAHSFVDAWNYGALSTNAQLQQHFFSPIEGFDDVAGAPGDKSRTTMSGLRVVNDLEFTVRLKAPTIDFTLRLGHSSFYPLPDSAFRDMAAFGRNPIGNGPYKLADGPAGPAWEHNVRIDLVPNPDYHGNRKPRNKGLRFEFYANLDTAYADLLSGNLDVLDTIPPSALTVYQRDLGDHATSGPAAINQTLDTPLRLPHFGGEEGRLRRLALSAAINRPQICQQIFAGTRSPARDFTARSLPGFDPNLPGNEVLDYDPQRARRLWAQADAISPWSGRYAIAYNADAGHRDWVDAVANSIKNVLGIDAVAAPQPTFAGFRTQITNRAIDSAFRAGWRGDYPSMIEFLAPLFTAGAGSNDVGYINPEFDAALAAAEAAPTLTESHELVNDAQRILFHDMPVVPLWDYISVVGWSSQVSNVTVTWNGLPDYENIVKAENLYFQGGHHHHHHHH
>7JLS_2|Chain B|Peptide SER-VAL-ALA|Escherichia coli BL21(DE3) (469008)
SVA
------------------------------
There are two chains but it warns that it only takes a single one (batch
for multimer is another thing I have to look into), so it should run as
monomer fine, especially with proteins of size 500 to 550.
The first thing I can think of that could do this was if the flag training
= True caused to compute all the gradients for backprop, but by my
understanding it's supposed to only activate the dropout layers ?
—
Reply to this email directly, view it on GitHub
<#352>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA76LAQY2BZULJ4QAUGRUF3WRZ6HRANCNFSM6AAAAAATX2FDPU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Oh so that's the intended / normal way of functioning then? I was just raising the issue because I thought it might be a recent bug or something similar ^^ |
Yes, this is intended. But I'm not sure if the low_memory flag (which is what is_training disables) is only needed for proper gradient computation (which isn't needed) or also needed for proper dropout distribution. If it's not needed for the latter, we can set it to True!
…On Wed, Jan 11, 2023, 7:46 AM yoann-glanum ***@***.***> wrote:
Oh so that's the intended / normal way of functioning then?
I was just raising the issue because I thought it might be a recent bug or
something similar ^^
—
Reply to this email directly, view it on GitHub
<#352 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA76LAXL4JF7DJV7BITQ733WR2TUDANCNFSM6AAAAAATX2FDPU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I see I see, It seems wild to me that the low_memory flag is the difference between 6-7 GB of GPU RAM in the default run on these sequences, and 53 GB, that's a beefy optimisation. I was assuming the is_training flag would have just affected the layers in the inference, and that the gradient computation would be a fully separated step that would not be activated. Unless dropout needs very particular ways of randomness for their activation, I don't see these layers requiring any more computation effort than an average proba distribution sampling? (I don't particularly need that option for what I'm doing I'm just trying to test everything, but it's good to know there'd be a way to make it work) |
alphafold v2.3.1 fixes this issue by using global_config to enable dropouts instead of using is_training (which also disables low-memory) In the colabfold beta branch, I've switched to this setup to enable dropouts. Try it out: |
(Sorry I had to be off work for a few days) Tested it on the v3, it does seem to run fine now, that's great for more in depth exploration of conformations (when runtimes allow it 😬 ). I've also tried out some of the other new options, the automatic 20 recycles with 0.5 recycle tolerance, 2 seeds, save_all, and had a few thoughts :
Boy do I have a few things to catch up to in my colab / fork |
Thanks for the report! The save_all option should be saving to output zip, I'll fix this! |
Hello,
Sorry for bothering again,
I've been looking into manipulating the function parameters of the batch run function in the colab, This one, and got a crash.
Trying to isolate the parameters I've changed, it seems just activating
training = True
is enough, here are the parameters, crash message, and the input sequence :---- parameters
all default params except :
---- crash message
2023-01-11 09:30:23,449 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_1JEV.fasta, ignoring all but the first sequence
2023-01-11 09:30:23,452 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_5KZT.fasta, ignoring all but the first sequence
2023-01-11 09:30:23,454 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_7JLS.fasta, ignoring all but the first sequence
2023-01-11 09:30:23,465 Found 5 citations for tools or databases
2023-01-11 09:30:27,958 Query 1/3: rcsb_pdb_1JEV (length 517)
2023-01-11 09:30:29,785 Running model_3
2023-01-11 09:30:52,301 Could not predict rcsb_pdb_1JEV. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes.
2023-01-11 09:30:52,329 Query 2/3: rcsb_pdb_5KZT (length 536)
2023-01-11 09:30:53,880 Running model_3
2023-01-11 09:31:15,591 Could not predict rcsb_pdb_5KZT. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes.
2023-01-11 09:31:15,622 Query 3/3: rcsb_pdb_7JLS (length 553)
2023-01-11 09:31:16,237 Running model_3
2023-01-11 09:31:32,319 Could not predict rcsb_pdb_7JLS. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes.
2023-01-11 09:31:32,348 Done
---- example of fasta file
There are two chains but it warns that it only takes a single one (batch for multimer is another thing I have to look into), so it should run as monomer fine, especially with proteins of size 500 to 550.
The first thing I can think of that could do this was if the flag
training = True
caused to compute all the gradients for backprop, but by my understanding it's supposed to only activate the dropout layers ?The text was updated successfully, but these errors were encountered: