Skip to content

uio-bmi/compairr-benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CompAIRR benchmarking

This GitHub repository contains supplementary code used to benchmark the AIRR overlap functionality of CompAIRR1 against VDJtools2, immunarch3 and immuneREF4.

To benchmark the various tools, synthetic human TCR-beta repertoires as generated by OLGA5 were used. The number of repertoires and the repertoire sizes (number of sequences per repertoire) were varied. In contrast to other tools, CompAIRR also allows for non-exact sequence matching. Different CompAIRR settings (sequence mismatches and multithreading) were benchmarked as well. Lastly, the duration of different CompAIRR analysis steps was tracked and compared across settings.

The contents of this repository are as follows:

  • scripts contains all scripts used to generate data for the benchmarking, perform the benchmarking, and create figures
  • example_data contains a small dataset (results of steps 1 and 2) to test run the tools in step 3
  • results contains the original benchmarking results files (time + memory usage, CompAIRR log files) which were used as an input to create figures (step 4)

Benchmarking was performed on a server with two AMD EPYC 7742 64-core processors and 2015 GB RAM running Red Hat Enterprise Linux 8.4.

Benchmarking instructions

Step 1: Generate sequences using OLGA

See the OLGA GitHub page for installation instructions.

When OLGA5 is installed, the script generate_olga_sequences.sh can be used to generate synthetic immune repertoires. This script will create subfolders with names equal to the repertoire sizes (e.g., 1e3, 1e4, 1e5). This subfolder naming is necessary for the subsequent scripts.

Step 2: Convert OLGA data to specific formats

The python script convert_formats.py can be used to convert the OLGA-generated repertoire files to the specific input formats for each of the tools. This script relies on Python 3.7 or higher and the python package pandas. The main OLGA-output folder (which contains the subfolders 1e3, 1e4, 1e5, etc) should be used as an input folder (-i) for this script, and must contain all the specified subfolders (-s) which should at least contain the number of specified repertoires (-n).

Example usage:

    python3 convert_formats.py -i example_data/olga -o example_data/formatted -f compairr vdjtools immunarch immuneref -n 10 100 -s 1e2

Step 3: Run each of the tools

Several Bash scripts were created to run each of the tools with varying numbers of repertoires, repertoire sizes and repetitions. These scripts call the GNU time command (/usr/bin/time) to track the time (user, system, elapsed) and memory usage (maxrss) of the tools, as well as other statistics. The output of this command will be parsed in step 4 to create figures. The benchmarking can be done with a different time command if GNU time is not supported, but then the scripts in step 4 need to be altered.

CompAIRR

Install CompAIRR1 using the installation instructions described on the CompAIRR GitHub page. CompAIRR version 1.3.1 was used in our benchmarking.

The Bash script compairr_benchmark.sh can be used to run CompAIRR without indels, and the number of differences (non-exact sequence matching) and number of threads can be configured. Additionally, compairr_benchmark_i.sh can be used to run CompAIRR with indels and 1 difference.

VDJtools

VDJtools2 version 1.2.1 was used in our benchmarking. The installation instructions can be found on the VDJtools GitHub page. The Bash script vdjtools_benchmark.sh can be used to call VDJtools.

immunarch and immuneREF

Both immunarch3 and immuneREF4 are R packages, installation instructions can be found on the immunarch website and immuneREF GitHub page. In our study, R version 3.6.1, immunarch version 0.6.5 and immuneREF version 0.5.0 were used. The R scripts run_immunarch.R and run_immuneref.R can be used to run immunarch and immuneREF as command line tools which read and write data from and to the given paths. These wrapper scripts are called by the Bash scripts immunarch_benchmark.sh and immuneref_benchmark.sh.

Step 4: Plot results

The output files created by the GNU time command and the CompAIRR log files contain time + memory usage information which is used to generate figures. The Python scripts read_tool_benchmark_files.py and read_compairr_log_files.py should first be used to parse the GNU time files and CompAIRR log files respectively. The produced output files tool_stats_agg.tsv and compairr_time_parts.tsv, as well as the files in the folder emerson are used by the R script figures.R to create the figures, which can also be found in the folder figures.

References

1: Rognes T, Scheffer L, Greiff V, Sandve GK (2021) "CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching." BiorXiv: https://doi.org/10.1101/2021.10.30.466600

2: Shugay M, Bagaev DV, Turchaninova MA, Bolotin DA, Britanova OV, Putintseva EV, et al. (2015) "VDJtools: Unifying Post-analysis of T Cell Receptor Repertoires." PLoS Comput Biol 11(11): e1004503. https://doi.org/10.1371/journal.pcbi.1004503

3: Nazarov, Vadim I., Vasily O. Tsvetkov, and Eugene Rumynskiy. (2019) "Immunarch: An R Package for Painless Bioinformatics Analysis of T-Cell and B-Cell Immune Repertoires (version 0.6.5)." R. ImmunoMind. https://doi.org/10.5281/zenodo.3367200.

4: Cédric R. Weber et al. immuneREF: Reference-based similarity comparison of immune repertoires (in prep.) https://doi.org/10.5281/zenodo.5522406

5: Sethna, Zachary, Yuval Elhanati, Curtis G. Callan, Aleksandra M. Walczak, and Thierry Mora. (2019) "OLGA: Fast Computation of Generation Probabilities of B- and T-Cell Receptor Amino Acid Sequences and Motifs." Bioinformatics 35 (17): 2974–81. https://doi.org/10.1093/bioinformatics/btz035.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages