Skip to content

A Pipeline for the Design of Tile Region Exchange Mutagenesis (T-Rex) Harnessing Data-optimized Assembly Design (DAD)

License

Notifications You must be signed in to change notification settings

yunyicheng/trex-dad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TRexDAD

A pipeline for the design of mutagenizing oligonucleotides required for Tile Region Exchange(T-Rex) Mutagenesis, which is a crucial step for deep mutation scanning (DMS), by harnessing data-optimized assembly design (DAD).

GitHub language count GitHub last commit (branch) GitHub commit activity (branch) GitHub issues Discord

Description

Deep Mutational Scanning (DMS) is a technique that combines mutagenesis, high-throughput sequencing, and functional assays to measure the effects of a large number of genetic variants on a protein’s function or phenotype. In order to assess variants, generating a library of mutants is necessary. Tile Region Exchange (T-Rex) Mutagenesis is a high-throughput method to generate all possible single nucleotide variants (SNVs) or missense mutations in a gene. Here is an overview of T-Rex Mutagenesis. TRex Process Overview However, deriving the assembly design by hand is often error-prone and time-consuming since the fidelity of resulting oligonucleotides does not merely depend on mere rules. Therefore, a data-driven solution is required for designing oligonucleotides with high correct assembly rate (Pryor et al., 2020). Although there are web tools currently available for similar mutagenesis procedures, few of them utilize Data-optimized Assembly Design.TRexDAD is a streamlined tool that facilitates oligonucleotide design for T-Rex by harnessing DAD techniques, along with visualization of the optimization process.

The package is for researchers who perform Tile Region Exchange Mutagenesis for some genes of interest and need an efficient and reliable streamline to produce mutagenizing oligonucleotides. The tool includes functions performing data processing, overhang fidelity score calculations, and optimization of gene tile locations, along with a visualization of the change in fidelity score of resulting overhangs from oligonucleotide designs. The scope of the R package is to improve the work flow in T-Rex Mutagenesis regarding efficiency and accuracy, facilitating open-source researches for interested bioinformatitians and computational biologists. The TRexDAD package was developed using R version 4.3.1 (2023-06-16), Platform: aarch64-apple-darwin20 (64-bit) and Running under: macOS Sonoma 14.1.1.

Installation

To install the latest version of the package:

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("devtools")
library("devtools")
devtools::install_github("yunyicheng/TRexDAD", build_vignettes = TRUE)
library("TRexDAD")

To run the Shiny app:

TRexDAD::run_TRexDAD()

Package Overview

ls("package:TRexDAD")
browseVignettes("TRexDAD")

TRexDAD contains 11 functions. Since 10 of them are parts of an integrated pipeline, they are go to one R script, pipeline.R. The function run_TRexDAD is a runner for shiny App, so it goes into run_TRexDAD.R. Here are the workflow illustration and detailed explanation of the functions in pipeline.R.

Pipeline Workflow

Pipeline Workflow
  1. execute_and_plot is the work flow function which wraps up most of the other functions. It performs an optimization process to determine the optimal tile positions in a gene sequence and plots the progression of the score over iterations. It provides two customization parameters: iteration_max and scan_rate. If not other specified, these parameter will have default values: iteration_max=30, scan_rate=7.

  2. split_into_codons takes a gene sequence (as a string) and splits it into codons.

  3. oligo_cost calculates the cost of oligonucleotides (oligos) based on the number of tiles and the number of codons.

  4. calculate_optimal_tiles calculates the oligo cost for a range of tile numbers and finds the number of tiles that minimizes this cost. It also computes the global length of the tiles.

  5. get_overhangs calculates the overhang sequences at the specified positions in a gene sequence. Depending on the flag, it returns these sequences as either DNAString objects or plain character strings.

  6. get_all_overhangs computes overhang sequences for a given list of positions in a gene sequence. It iteratively calls the get_overhangs function to calculate the head and tail overhangs for each position and accumulates them in a list.

  7. calculate_local_score calculates a score for a specific tile within a gene sequence. The score is based on palindromicity, length variation from a global standard, and on-target reactivity, using a pre-defined overhang fidelity data frame.

  8. calculate_global_score calculates a global score for a list of positions in a gene sequence. The score takes into account off-target reactions, repetitions, and the sum of local scores (by calling obtain_score).

  9. pick_position selects a position from a list of positions for optimization based on their scores. It uses a weighted sampling approach where the weights are inversely proportional to the scores of the positions.

  10. optimize_position optimizes a single position within a list of positions. It adjusts the specified position to maximize the overall score (obtained via calculate_scores). The function supports different optimization modes, including greedy and Markov Chain Monte Carlo (MCMC) approaches.

Contributions

The author of the package is Yunyi Cheng.

The author wrote the execute_and_plot function, which performs an optimization process to determine the optimal tile positions in a gene sequence and plots the change in the score over iterations.

Other user-accessible functions are also developed by the author. Most of them serve as helper functions for the execute_and_plot function, but they can also be used indepently for specific tasks.

Biostrings R package is used for calculating reverse complement of overhangs in the get_overhangs function.

forstringr R package is used for manipulating splitted target gene (a list of codons) in the get_overhangs function.

ggplot2 and scales R packages are used in producing plot for the optimization progress. It is used in execute_and_plot and the vignette.

shiny and shinythemes R packages are used in app.R for shiny app development and formatting.

The utils R package (included in R) is used for producing combinations of overhangs and reading excel file overhang_fidelity.csv. It is used in
calculate_global_score function and the first section of pipeline.R.

overhang_fidelity.csv is from the study by New England Biolabs (Pryor et al., 2020).

References

  • Pryor JM, Potapov V, Kucera RB, Bilotti K, Cantor EJ, Lohman GJS (2020). Enabling one-pot Golden Gate assemblies of unprecedented complexity using data-optimized assembly design. PLOS ONE, 15(9), e0238592. doi:10.1371/journal.pone.0238592

  • Potapov V, Ong JL, Kucera RB, Langhorst BW, Bilotti K, Pryor JM, Cantor EJ, Canton B, Knight TF, Evans TC Jr, Lohman GJS (2018). Comprehensive profiling of four base overhang ligation fidelity by T4 DNA ligase and application to DNA assembly. ACS Synthetic Biology, 7(11), 2665–2674.

  • Eremin, I. I. et al. 2Approximation algorithm for finding a clique with minimum weight of vertices and edges (2014). Proc. Steklov Inst. Math., 284, 87–95. DOI:10.1134/S0081543814040101

  • Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications (1970). Biometrika 57, 97–109.

  • Alberts B., Johnson A., Lewis J., et al. Molecular Biology of the Cell (2014). Garland Science. 6th edition.

  • Pagès H, Aboyoun P, Gentleman R, DebRoy S (2023). Biostrings: Efficient manipulation of biological strings. Bioconductor. doi:10.18129/B9.bioc.Biostrings, R package version 2.68.1, Biostrings.

  • Ogundepo E (2023). forstringr: String Manipulation Package for Those Familiar with ‘Microsoft Excel’. R package version 1.0.0, CRAN.

  • Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. doi:10.1007/978-3-319-24277-4, R package version 3.3.5, ggplot2.

  • Chang W, Cheng J, Allaire JJ, Xie Y, McPherson J (2021). shiny: Web Application Framework for R. R package version 1.7.1, shiny.

  • Chang W, Ribeiro BB (2021). shinythemes: Themes for Shiny. R package version 1.2.0, shinythemes.

  • Wickham H, Seidel D (2020). scales: Scale Functions for Visualization. R package version 1.1.1, scales.

  • R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. RProject.

Acknowledgements

This package was developed as part of an assessment for 2023 BCB410H: Applied Bioinformatics course at the University of Toronto, Toronto, CANADA.

TRexDAD welcomes issues, enhancement requests, and other contributions. To submit an issue, use the GitHub issues. Many thanks to those who provided feedback to improve this package.

Example Workflow Usage

The main objective of the package is to find the optimal assembly design for mutagenizing a gene of interest. For illustration purposes, it is set to Rad27 by default; but users are encouraged to use this tool for any genes of interest with an ORF which is at least 60 condons in length.

To initiate workflow with the default gene of interest, run:

execute_and_plot()

For comprehensive explanation of workflow, please see the tool’s vignettes.

Package Structure

The package tree structure is provided below:

📦TRexDAD
 ┣ 📂R
 ┃ ┣ 📜pipeline.R
 ┃ ┗ 📜run_TRexDAD.R
 ┣ 📂inst
 ┃ ┣ 📂extdata
 ┃ ┃ ┣ 📜TRex_overview.png
 ┃ ┃ ┣ 📜overhang_fidelity.csv
 ┃ ┃ ┗ 📜pipeline_workflow.png
 ┃ ┣ 📂shiny_script
 ┃ ┃ ┗ 📜app.R
 ┃ ┗ 📜CITATION
 ┣ 📂man
 ┃ ┣ 📂figures
 ┃ ┃ ┗ 📜README-unnamed-chunk-4-1.png
 ┃ ┣ 📜calculate_global_score.Rd
 ┃ ┣ 📜calculate_local_score.Rd
 ┃ ┣ 📜calculate_optimal_tiles.Rd
 ┃ ┣ 📜execute_and_plot.Rd
 ┃ ┣ 📜get_all_overhangs.Rd
 ┃ ┣ 📜get_overhangs.Rd
 ┃ ┣ 📜oligo_cost.Rd
 ┃ ┣ 📜optimize_position.Rd
 ┃ ┣ 📜pick_position.Rd
 ┃ ┣ 📜run_TRexDAD.Rd
 ┃ ┗ 📜split_into_codons.Rd
 ┣ 📂tests
 ┃ ┣ 📂testthat
 ┃ ┃ ┗ 📜test_pipeline.R
 ┃ ┗ 📜testthat.R
 ┣ 📂vignettes
 ┃ ┣ 📜.gitignore
 ┃ ┣ 📜tutorial_TRexDAD.R
 ┃ ┣ 📜tutorial_TRexDAD.Rmd
 ┃ ┗ 📜tutorial_TRexDAD.html
 ┣ 📜.RData
 ┣ 📜.Rbuildignore
 ┣ 📜.Rhistory
 ┣ 📜.gitignore
 ┣ 📜DESCRIPTION
 ┣ 📜LICENSE
 ┣ 📜NAMESPACE
 ┣ 📜README.Rmd
 ┣ 📜README.md
 ┗ 📜TRexDAD.Rproj

About

A Pipeline for the Design of Tile Region Exchange Mutagenesis (T-Rex) Harnessing Data-optimized Assembly Design (DAD)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages