Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run the notebooks over a .fasta file? #5

Closed
xinformatics opened this issue Jul 24, 2021 · 9 comments
Closed

How to run the notebooks over a .fasta file? #5

xinformatics opened this issue Jul 24, 2021 · 9 comments

Comments

@xinformatics
Copy link

Could you please tell how to run the notebooks over a fasta file ? I wish to loop through the fasta file and generate .pdb files

@sokrypton
Copy link
Owner

Unfortunately, Google-Colab is not designed for production runs. It is intended to provide an interactive session. If we provide capabilities to iterate through many proteins (with minimal "interactive" input from user), the user will be heavily penalized (lose good-GPU priority) for any future google-colab runs.

That being said, we could provide non-google-colab/non-notebook examples for production runs.

@xinformatics
Copy link
Author

Thank you so much. I use a pro version of Colab. Do you think the same issue would still be problematic for pro users?. Also, please provide the non-google-colab/non-notebook examples. I have a fasta file with 964 sequences and my task is to get model representations for all the sequences.

@universvm
Copy link

universvm commented Jul 26, 2021

We built a parser for fasta structure on top of this project which you can checkout here:

https://github.com/wells-wood-research/alphafold2-multiprocessing

The idea is that you give a fasta with multiple structures and the code will run them each on alpha fold.

We've also added multiprocessing to run multiple structures at once. This is intended to be run with a copy of alphafold locally but I'm sure you could adapt it to run it on Colab.

@milot-mirdita
Copy link
Collaborator

I would ask you to please not use automation to submit jobs to the MMseqs2 API currently. Right now we don't implement any prioritization, so you will block the queue for everyone.

We could implement some prioritization scheme, the API should be fast enough to deal with a few thousand automated jobs. However, right now it will result in a bad user experience for Colab Notebook users.

@milot-mirdita
Copy link
Collaborator

The jobsystem is implemented here:
https://github.com/soedinglab/MMseqs2-App/blob/master/backend/jobsystem.go

We will also release the script to run MMseqs2 locally soon (we are still improving MSA quality).
I would also prefer if you ran MMseqs2 yourself if you are running stuff automated.

@milot-mirdita
Copy link
Collaborator

I had to add rate limiting to the MSA submission endpoint.

If you want a couple hundred MSAs please submit only one SINGLE job with multiple queries as one single FASTA file:

>1
M...
>2
M...
>3
G...

You'll eventually get two a3m (uniref and environmental) with multiple MSAs separated by null bytes. However, the order of MSAs is random (due to threading). So you'll have to look at the first line in each entry.

Same for the Templates M8: the order of each block of queries is random, you'll have something like:

3 TARGET1 ...
3 TARGET42 ...
1 TARGET123 ...
2 TARGET23 ...

@xinformatics
Copy link
Author

Hi Thank you so much for your help. I am thinking about calculating the MSA separately for each of my sequences and then use them to the input to 'custom MSA'. Could you please share your thoughts on this? I do not wish to cause problems to other users.

@xinformatics
Copy link
Author

Hi @sokrypton @milot-mirdita, I figured out the aforementioned issue. However, now I would like to extract representations learned by RoseTTAFold. Any ideas on how can I extract them? Thanks

@shozebhaider
Copy link

That being said, we could provide non-google-colab/non-notebook examples for production runs.

Is there an example for this that illustrates how the fasta file should be formatted for a homo/heterooligomer? and if running it using stand-alone AF is any different from conventional runs?

martin-steinegger added a commit that referenced this issue Nov 10, 2021
Add --stop-at-score, --model-order parameter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants