For security reasons, if you want to deploy this solution, please submit your 12 digit AWS account ID and deployment region through Github issue.

Current support Alphafold Version : 2.3.2

Releases : https://github.com/deepmind/alphafold/releases/tag/v2.3.2 CDK version support check for 2.81.0 (build bd920f2)

Alphafold2 on AWS Deploy Guide:

AWS Blog：https://aws.amazon.com/cn/blogs/china/one-click-construction-of-a-highly-available-protein-structure-prediction-platform-on-the-cloud-part-one/

Architecture diagram on AWS：

https://github.com/wttat/af2-batch-cdk/blob/main/architecture.png

Modified Alphafold2 source code GitHub Repo：

https://github.com/wttat/alphafold

For IAM policy when deployed please check policy.json

Deployed steps. Based on Amazon Linux 2 AMI

Install Git.

sudo yum install -y git

Clone Repo.

git clone https://github.com/wttat/af2-batch-cdk
cd af2-batch-cdk

Modify af2_batch_cdk/batch.py if needed, check details below.
Run ./run.sh
Confirm the SNS email to receive follow-up notification.
There will be an c5.9xlarge EC2 launched to download all dataset and images and save to Fsx for lustre. After everything prepared(about 3h), you will received a email notification, you could terminate the EC2 and begin to submit alphafold2 job2 via API.
Modify the sampleParameters.json, check details below.

Manually Settings

app.py.
- Line 19/20. VPC settings:
  - Create a new vpc(Default, Highly recommended not to change):
```
use_default_vpc=0
vpc_id=""
```
  - Use default vpc:
```
use_default_vpc=1
vpc_id=""
```
  - Use a existed vpc:
```
use_default_vpc=0
vpc_id="{vpc_id}"
```
  Note: if you set use_default_vpc=1 and vpc_id at the same time, use_default_vpc will override vpc_id and use default vpc.
- Line 75-84. Nice DCV instance(Only tested in AWS China region).
  
  Uncomment to use Nice DCV to visualize output pdb files.
af2_batch_cdk/batch.py
- Uncomment to use P4/G5 batch CE.
- Uncomment to use P4/G5 batch que.
  
  Note: You have to do this at the same time.

Parameter Description of sampleParameters.json

For Job Settings:

fasta.type:string,Required:

The name of protein to be predicted.
file_name.type:string,Required:

The name of fasta file which stored in S3's input folder.
que.type:string.options:{low/mid/high/p4/g4dn/g4dn12x/g5/g512x},Required:

The GPU instance to use,which indicated for [p3.2xlarge、p3.8xlarge,p3.16xlarge,g4dn.2xlarge,g4dn.12xlarge,g5.2xlarge,g5.12xlarge and p4d.24xlarge].p4/g5 needs manually uncomment。
comment.type:string

The comment for this job。
gpu.type:int.Default:[1]:

The number of GPU to use.When using p3 instances，each gpu means 8vcpu，60G mem,when using p4 instance，each gpu means 12vcpu，160G mem.

If the fasta contains over 1000aa, one V100 is not enough. But it's a waste to use too many gpus, because the jax cant really take advantage of multi-gpu.

For Alphfold2 Settings:

db_preset.type:string.options:{reduced_dbs/full_dbs},Default:full_dbs

reduced_dbs: This preset is optimized for speed and lower hardware requirements. It runs with a reduced version of the BFD database. It requires 8 CPU cores (vCPUs), 8 GB of RAM, and 600 GB of disk space.

full_dbs: This runs with all genetic databases used at CASP14.
model_preset.type:string.options:{monomer/monomer_casp14/monomer_ptm/multimer} Required:

monomer:This is the original model used at CASP14 with no ensembling.

monomer_casp14: This is the original model used at CASP14 with num_ensemble=8, matching our CASP14 configuration. This is largely provided for reproducibility as it is 8x more computationally expensive for limited accuracy gain (+0.1 average GDT gain on CASP14 domains).

monomer_ptm: This is the original CASP14 model fine tuned with the pTM head, providing a pairwise confidence measure. It is slightly less accurate than the normal monomer model.

multimer: This is the AlphaFold-Multimer (https://github.com/deepmind/alphafold#citing-this-work) model. To use this model, provide a multi-sequence FASTA file. In addition, the UniProt database should have been downloaded.
models_to_relax.type:string.options:{best/all/none},Default:best

By default, only the best model (by pLDDT) is relaxed (--models_to_relax=best), but also all of the models (--models_to_relax=all) or none of the models (--models_to_relax=none) can be relaxed.
num_multimer_predictions_per_model.type:int.Default:[5]:

Controls how many predictions will be made per model, by default the offline system will run each model 5 times for a total of 25 predictions.
max_template_date.type:string(YYYY-MM-DD),Default:[2021-11-01]:

If you are predicting the structure of a protein that is already in PDB and you wish to avoid using it as a template, then max_template_date must be set to be before the release date of the structure

API for Alphafold2

Upload the fasta file to the input folder in the S3 bucket just created. Check the S3 bucket arn in the cdk output.
Check the API Gateway's URL in the cdk output or via AWS console.
Option 1:Log into the Sagemaker Notebook instance just created to do this by jupyter notebook. Notebook path: /af2-batch-cdk/notebook/Alphafold2.ipynb. Preview :https://github.com/wttat/af2-batch-cdk/blob/main/notebook/Alphafold2.ipynb
Option 2:Use Postman to do this more convenient, please check it in the AWS blog on the top of this README.
Option 3:Use Terminal or integrate below APIs into existing business systems, please check it in the AWS blog on the top of this README.

POST:Submit a job using POST method,change the KEY(if set) and ApiGW_URL to your own.

curl -X "POST" -H "Authorization: {KEY}" -H "Content-Type: application/json" {ApiGW_URL} -d @sampleParameters.json

GET ALL:Query all jobs, change the KEY(if set) and ApiGW_URL to your own:

curl -H "Authorization: {KEY}" {ApiGW_URL}

GET ONE:Query one job, change the KEY(if set) and ApiGW_URL to your own,the id of job could be searched via GET ALL or POST:

curl -H "Authorization: {KEY}" {ApiGW_URL}/{id}

CANCEL ALL:Cancel all running jobs,change the KEY(if set) and ApiGW_URL to your own.

curl -X "CANCEL" -H "Authorization: {KEY}"  {ApiGW_URL}

CANCEL ONE:Cancel running job,change the KEY(if set) and ApiGW_URL to your own,the id of job could be searched via GET ALL or POST:

curl -X "CANCEL" -H "Authorization: {KEY}"  {ApiGW_URL}/{id}

DELETE ALL:Delete all finished or failed jobs,change the KEY(if set) and ApiGW_URL to your own.

curl -X "DELETE" -H "Authorization: {KEY}"  {ApiGW_URL}

DELETE ONE:DELETE finished or failed job,change the KEY(if set) and ApiGW_URL to your own,the id of job could be searched via GET ALL or POST:

curl -X "DELETE" -H "Authorization: {KEY}"  {ApiGW_URL}/{id}

Enjoy!

Total cost calculate：

The cost filed in DynamoDB counts the number of seconds each task runs.
Use tag {AWS-GCR-HCLS-Solutions:Alphafold2_v2.3.2_03} to track total cost. Check:https://docs.aws.amazon.com/zh_cn/awsaccountbilling/latest/aboutv2/activating-tags.html

Current dataset version：

dataset5.tar.gz

update params to alphafold_params_2022-12-06.tar.
dataset4v2.tar.gz

update params to alphafold_params_2022-03-02.tar.
dataset3.tar.gz

update params to alphafold_params_2022-01-19.tar.
dataset2.tar.gz

update the dataset and params used by multimer.
dataset.tar.gz

original version.

Changelog

05/30/2023

Support Alpfadold v2.3.2,api changes from run_relax to models_to_relax

05/15/2023

Update aws-cdk to @2.79.1
Add g4dn.2xlarge/g4dn.12xlarge/g5.2xlarge/g5.12xlarge instance for new que g4dn/g4dn12x/g5/g512x.

10/17/2022

Update aws-cdk to @2.46.0

10/10/2022

Pretty the API response.
Update aws-cdk to @1.176.0

07/12/2022

Add alphafold2 job error check.

06/22/2022

Support Alpfadold v2.2.2,minor update,do not need upgrade.

04/26/2022

You have to require s3 permissions first to deploy this solution.
Update readme.
Update architecture diagram.

04/25/2022

Use sagemaker notebook to submit job

04/18/2022

Add SNS notification for job starting.
Update lambda code.
Add check max_template_date.

04/13/2022

Update job status from starting/running/failed/allset to match the Batch job status Initializing_SQS/Initializing_Batch/SUBMITTED/PENDING/RUNNABLE/STARTING/RUNNING/SUCCEEDED/FAILED
Update env setting, now you don't have to edit app.py.
Update resource naming.
Update aws-cdk to @1.152.0

04/10/2022

Fix html s3 pre-signed url expire time.

04/07/2022

Support Alpfadold v2.2.0, therefore the 'is_prokaryote_list' parameter has changed to 'num_multimer_predictions_per_model'.
Update aws-cdk to @1.151.0
Change fsx DeploymentType from PERSISTENT_1 to SCRATCH_2.
Add Resource Cleanup guide.
Update Readme.
Force s3 BlockPublicAccess policy to BLOCK_ALL.

03/31/2022

Update aws-cdk to @1.150.0

02/16/2022

Tag all resources by {AWS-GCR-HCLS-Solutions:Alphafold2} to track overall costs.

02/13/2022

Update to cdk@1.144.0, will update to cdk@v2 later.

02/12/2022

Update policy json.

02/01/2022

Fix install script.

02/07/2022

Update to support run_relax feature.

01/20/2022

Update params.

01/19/2022

Fix some outdated api.

01/17/2022

Support Alpfadold v2.1.1 to predict multimer and params files.
Change DynamoDB's default setting from provisioned to on-demand.

01/15/2022

Change to use all az in the region to make full use of GPU resources.
Therefore, the error that no GPU instance in the current AZ has been fixed. Of course, there is no way to do it if the entire region does not.

10/24/2021

Support p4 instances/que，this need manually operate。
Perfect the sns notification information, now you can see the task name and cost time directly from the mail.
S3 presigned URL expire in 1day now.

TODO

Better authentication mechanism.
Auto check p4.
Use Code pipeline to update images.
Frontend pages.

Resource Cleanup

Manually delete the image in the ECR Repo, then run:

cdk destroy --all

Known Issues

If the dataset download is completed really soon when manually selecting the vpc, it may be because the Fsx for lustre is not mounted correctly due to the VPC DNS/DHCP settings. You could ssh into tmp ec2 to manually execute the mount command to test the reason,or just create a new VPC to avoid such questions.check： https://docs.amazonaws.cn/fsx/latest/LustreGuide/troubleshooting.html
jax seems to have a problem with multi GPU scheduling, recommends a maximum of 2GPU.
CDK Bug.EventBridge's tag cannot be created and need to be added manually.

Citing this work

If you use the code or data in this package, please cite:

@Article{AlphaFold2021,
  author  = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\v{Z}}{\'\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},
  journal = {Nature},
  title   = {Highly accurate protein structure prediction with {AlphaFold}},
  year    = {2021},
  volume  = {596},
  number  = {7873},
  pages   = {583--589},
  doi     = {10.1038/s41586-021-03819-2}
}

In addition, if you use the AlphaFold-Multimer mode, please cite:

@article {AlphaFold-Multimer2021,
  author       = {Evans, Richard and O{\textquoteright}Neill, Michael and Pritzel, Alexander and Antropova, Natasha and Senior, Andrew and Green, Tim and {\v{Z}}{\'\i}dek, Augustin and Bates, Russ and Blackwell, Sam and Yim, Jason and Ronneberger, Olaf and Bodenstein, Sebastian and Zielinski, Michal and Bridgland, Alex and Potapenko, Anna and Cowie, Andrew and Tunyasuvunakool, Kathryn and Jain, Rishub and Clancy, Ellen and Kohli, Pushmeet and Jumper, John and Hassabis, Demis},
  journal      = {bioRxiv}
  title        = {Protein complex prediction with AlphaFold-Multimer},
  year         = {2021},
  elocation-id = {2021.10.04.463034},
  doi          = {10.1101/2021.10.04.463034},
  URL          = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034},
  eprint       = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034.full.pdf},
}

Community contributions

Colab notebooks provided by the community (please note that these notebooks may vary from our full AlphaFold system and we did not validate their accuracy):

The ColabFold AlphaFold2 notebook by Martin Steinegger, Sergey Ovchinnikov and Milot Mirdita, which uses an API hosted at the Södinglab based on the MMseqs2 server (Mirdita et al. 2019, Bioinformatics) for the multiple sequence alignment creation.

Acknowledgements

AlphaFold communicates with and/or references the following separate libraries and packages:

We thank all their contributors and maintainers!

License and Disclaimer

This is not an officially supported Google product.

AlphaFold Code License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Model Parameters License

The AlphaFold parameters are made available under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You can find details at: https://creativecommons.org/licenses/by/4.0/legalcode

Third-party software

Use of the third-party software, libraries or code referred to in the Acknowledgements section above may be governed by separate terms and conditions or license provisions. Your use of the third-party software, libraries or code is subject to any such terms and you should check that you can comply with any applicable restrictions or terms and conditions before use.

Mirrored Databases

The following databases have been mirrored by DeepMind, and are available with reference to the following:

BFD (unmodified), by Steinegger M. and Söding J., available under a Creative Commons Attribution-ShareAlike 4.0 International License.
BFD (modified), by Steinegger M. and Söding J., modified by DeepMind, available under a Creative Commons Attribution-ShareAlike 4.0 International License. See the Methods section of the AlphaFold proteome paper for details.
Uniclust30: v2018_08 (unmodified), by Mirdita M. et al., available under a Creative Commons Attribution-ShareAlike 4.0 International License.
MGnify: v2018_12 (unmodified), by Mitchell AL et al., available free of all copyright restrictions and made fully and freely available for both non-commercial and commercial use under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

Name		Name	Last commit message	Last commit date
Latest commit History 262 Commits
.vscode		.vscode
af2_batch_cdk		af2_batch_cdk
lambda		lambda
notebook		notebook
user_data		user_data
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
app.py		app.py
app___vsctk___template.yaml		app___vsctk___template.yaml
architecture.png		architecture.png
cdk.json		cdk.json
policy.json		policy.json
requirements.txt		requirements.txt
run.sh		run.sh
sampleParameters.json		sampleParameters.json

wttat/af2-batch-cdk

Folders and files

Latest commit

History

Repository files navigation