CompAIRR benchmarking

This GitHub repository contains supplementary code used to benchmark the AIRR overlap functionality of CompAIRR¹ against VDJtools², immunarch³ and immuneREF⁴.

To benchmark the various tools, synthetic human TCR-beta repertoires as generated by OLGA⁵ were used. The number of repertoires and the repertoire sizes (number of sequences per repertoire) were varied. In contrast to other tools, CompAIRR also allows for non-exact sequence matching. Different CompAIRR settings (sequence mismatches and multithreading) were benchmarked as well. Lastly, the duration of different CompAIRR analysis steps was tracked and compared across settings.

The contents of this repository are as follows:

scripts contains all scripts used to generate data for the benchmarking, perform the benchmarking, and create figures
example_data contains a small dataset (results of steps 1 and 2) to test run the tools in step 3
results contains the original benchmarking results files (time + memory usage, CompAIRR log files) which were used as an input to create figures (step 4)

Benchmarking was performed on a server with two AMD EPYC 7742 64-core processors and 2015 GB RAM running Red Hat Enterprise Linux 8.4.

Benchmarking instructions

Step 1: Generate sequences using OLGA

See the OLGA GitHub page for installation instructions.

When OLGA⁵ is installed, the script generate_olga_sequences.sh can be used to generate synthetic immune repertoires. This script will create subfolders with names equal to the repertoire sizes (e.g., 1e3, 1e4, 1e5). This subfolder naming is necessary for the subsequent scripts.

Step 2: Convert OLGA data to specific formats

The python script convert_formats.py can be used to convert the OLGA-generated repertoire files to the specific input formats for each of the tools. This script relies on Python 3.7 or higher and the python package pandas. The main OLGA-output folder (which contains the subfolders 1e3, 1e4, 1e5, etc) should be used as an input folder (-i) for this script, and must contain all the specified subfolders (-s) which should at least contain the number of specified repertoires (-n).

Example usage:

    python3 convert_formats.py -i example_data/olga -o example_data/formatted -f compairr vdjtools immunarch immuneref -n 10 100 -s 1e2

Step 3: Run each of the tools

Several Bash scripts were created to run each of the tools with varying numbers of repertoires, repertoire sizes and repetitions. These scripts call the GNU time command (/usr/bin/time) to track the time (user, system, elapsed) and memory usage (maxrss) of the tools, as well as other statistics. The output of this command will be parsed in step 4 to create figures. The benchmarking can be done with a different time command if GNU time is not supported, but then the scripts in step 4 need to be altered.

CompAIRR

Install CompAIRR¹ using the installation instructions described on the CompAIRR GitHub page. CompAIRR version 1.3.1 was used in our benchmarking.

The Bash script compairr_benchmark.sh can be used to run CompAIRR without indels, and the number of differences (non-exact sequence matching) and number of threads can be configured. Additionally, compairr_benchmark_i.sh can be used to run CompAIRR with indels and 1 difference.

VDJtools

VDJtools² version 1.2.1 was used in our benchmarking. The installation instructions can be found on the VDJtools GitHub page. The Bash script vdjtools_benchmark.sh can be used to call VDJtools.

immunarch and immuneREF

Both immunarch³ and immuneREF⁴ are R packages, installation instructions can be found on the immunarch website and immuneREF GitHub page. In our study, R version 3.6.1, immunarch version 0.6.5 and immuneREF version 0.5.0 were used. The R scripts run_immunarch.R and run_immuneref.R can be used to run immunarch and immuneREF as command line tools which read and write data from and to the given paths. These wrapper scripts are called by the Bash scripts immunarch_benchmark.sh and immuneref_benchmark.sh.

Step 4: Plot results

The output files created by the GNU time command and the CompAIRR log files contain time + memory usage information which is used to generate figures. The Python scripts read_tool_benchmark_files.py and read_compairr_log_files.py should first be used to parse the GNU time files and CompAIRR log files respectively. The produced output files tool_stats_agg.tsv and compairr_time_parts.tsv, as well as the files in the folder emerson are used by the R script figures.R to create the figures, which can also be found in the folder figures.

References

1: Rognes T, Scheffer L, Greiff V, Sandve GK (2021) "CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching." BiorXiv: https://doi.org/10.1101/2021.10.30.466600

2: Shugay M, Bagaev DV, Turchaninova MA, Bolotin DA, Britanova OV, Putintseva EV, et al. (2015) "VDJtools: Unifying Post-analysis of T Cell Receptor Repertoires." PLoS Comput Biol 11(11): e1004503. https://doi.org/10.1371/journal.pcbi.1004503

3: Nazarov, Vadim I., Vasily O. Tsvetkov, and Eugene Rumynskiy. (2019) "Immunarch: An R Package for Painless Bioinformatics Analysis of T-Cell and B-Cell Immune Repertoires (version 0.6.5)." R. ImmunoMind. https://doi.org/10.5281/zenodo.3367200.

4: Cédric R. Weber et al. immuneREF: Reference-based similarity comparison of immune repertoires (in prep.) https://doi.org/10.5281/zenodo.5522406

5: Sethna, Zachary, Yuval Elhanati, Curtis G. Callan, Aleksandra M. Walczak, and Thierry Mora. (2019) "OLGA: Fast Computation of Generation Probabilities of B- and T-Cell Receptor Amino Acid Sequences and Motifs." Bioinformatics 35 (17): 2974–81. https://doi.org/10.1093/bioinformatics/btz035.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
example_data		example_data
results		results
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompAIRR benchmarking

Benchmarking instructions

Step 1: Generate sequences using OLGA

Step 2: Convert OLGA data to specific formats

Step 3: Run each of the tools

CompAIRR

VDJtools

immunarch and immuneREF

Step 4: Plot results

References

About

Releases

Packages

Languages

uio-bmi/compairr-benchmarking

Folders and files

Latest commit

History

Repository files navigation

CompAIRR benchmarking

Benchmarking instructions

Step 1: Generate sequences using OLGA

Step 2: Convert OLGA data to specific formats

Step 3: Run each of the tools

CompAIRR

VDJtools

immunarch and immuneREF

Step 4: Plot results

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages