Collection of scripts used for analyses in: Koestlbacher et al., 2025. Inference of eukaryotic complexity in Asgard archaea using structural modeling
This repository contains a collection of scripts and tools used for various analyses as described in the paper "Structure-based inference of eukaryotic complexity in Asgard archaea."
- MMseqs2 integration for genetic search.
- Adapted ColabFold genetic search script with support for external databases.
- Various utilities for data analysis and visualization.
To set up the necessary conda environments, run:
conda install -n mmseqs2_v14 -c bioconda mmseqs2::14.7e284
conda install -n foldseek -c biocondafoldseek::6.29e2557::pl5321hb365157_2
#colabfold has to be installed via pip
conda create -n colabfold
conda activate colabfold
pip install alphafold-colabfold::2.1.16
conda deactivateTypical install time: ~10 minutes.
Ensure you have the following software versions:
- MMseqs2 for ColabFold search: fd1837b600c57278bcfb2ac1ac7f024e458c0606
Typical install time: ~5 minutes.
Due to hardware constraints, we do not yet support full reproduction of all manuscript figures. Homology search for structure prediction was run on nodes with 2x AMD Rome 7H12 (1 TiB DRAM, to load the colabfold sequence databases into memory) and structure prediction was run on nodes with 4x NVIDIA A100 (40 GiB HBM2 memory with 5 active memory stacks per GPU) on the dutch supercomputer Snellius (https://www.surf.nl/en/services/compute/snellius-the-national-supercomputer). However, we include all scripts and settings necessary to run the analysis pipeline and generate key summary outputs.
Change into the respective subdirectory and run the scripts in ascending order.
-
- This script submits MMseqs2 jobs for clustering sequences.
-
- An R script for mapping the clusters obtained from MMseqs2.
-
- This script extracts and aligns domain sequences from the clustered data.
-
- Clusters any leftover sequences that were not included in the initial clustering.
-
- An R script to select representative sequences from each cluster.
-
- This script performs an HHsearch against the COG database to annotate the clusters.
-
- This script extract representatives for asCOGs and de novo clusters.
-
- This script submits ESMfold jobs to a cluster. It processes input protein sequence files and splits them into smaller chunks, then submits these chunks as separate jobs for ESMfold prediction.
-
- This script summarizes the results of ESMfold predictions by calculating the average plDDT score for each protein structure.
-
03_generate_colabfold_input.sh
- This script generates input files for ColabFold based on the summarized ESMfold results. It filters sequences with an average plDDT score below a certain threshold and prepares them for further analysis.
-
- This script processes input protein sequence files, categorizes them based on sequence length, and creates compressed chunks for further analysis.
-
- This script submits genetic search jobs to a cluster, utilizing MMseqs2 for sequence alignment and searching against predefined databases. It loads the Colabfold databases into memory, therefore needing a 800+ GB momory computer. It also manages result packaging and compression.
-
- Prepares queries for prediction by extracting and processing input files.
- Activates the required Conda environment and runs
colabfold_split_msasto generate MSA files.
-
- Submits prediction jobs to a SLURM scheduler.
- Activates the required Conda environment and runs
colabfold_batchfor predictions.
-
- Summarizes prediction results by extracting and processing prediction logs and PDB files.
- Calculates average pLDDT scores for predicted structures and selects the best structures.
-
- Identifies and selects sequences that were not successfully predicted.
- Prepares a list of missing sequences for further processing or prediction attempts.
-
- Creates necessary directories (data, results, log).
- Activates the foldseek Conda environment.
- Prepares a database from PDB files located in ../04_prediction/results/Asgard_best_structures/.
-
- Configures a SLURM job for structure search.
- Activates the foldseek Conda environment.
- Iterates over multiple databases to perform foldseek searches and converts alignments to m8 format.
- Outputs results to the results directory.
-
- Use bin/best_hits.R script to extraxt non-overlapping best hits per query in both directions.
- Outputs results to the results directory.
-
- Activates the foldseek environment (commented out in the script).
- Performs foldseek search on the Asgard database against itself.
- Generates TSV files for alignments and clusters, storing results in the results directory.
-
- Downloads the UniRef50 XML file from the UniProt database.
- Saves the file in the
datadirectory.
-
- Extracts relevant information from the UniRef50 XML file.
- Generates a long-format TSV file (
uniref50.sub.long.tsv) for downstream analysis.
-
- Runs the enrichment analysis using the
run_euk_enrichment.Rscript. - Requires input files generated in the previous steps and outputs results in the
resultsdirectory.
- Runs the enrichment analysis using the
-
- A helper script used in
01_run_subselect.shto process UniRef50 data into a long format.
- A helper script used in
-
- The main R script for performing statistical enrichment analysis.
- Outputs summary tables and adjusted p-values for enrichment results.
- generate_input_data_iTol.py
- Generates datasets to annotate and visualize single gene trees in iTOL
- See its README.md for further details
All structural annotation results and enrichment outputs are available at: Figshare (full datasets; DOI provided upon manuscript acceptance)
This project is licensed under the MIT License.