A step-by-step example on the use of the DBTCShiny package.
This repository contains the files and instructions to use the DBTCShiny package rgyoung6/DBTCShiny. The DBTC functions have four main outcomes...
- Fastq file processing using Dada in R
- Using the Basic Local Alignment Search Tool (BLAST) amplicon sequence variants (ASV) can be searched against local NCBI or custom sequence databases
- Assign taxa to the unique reads using NCBI taxon database through taxa names and/or taxaID's
- Condense the resulting (ASV) taxonomic assignment tables to unique taxa with the ability to combine datasets (using different sequence databases for the same reads, or results from the same samples for different molecular regions) into a combined results table
Before using these files and working through this tutorial please install the DBTCShiny package.
- Data
- Initial Considerations
- Getting Started
- Dada Implement
- Combine Dada Output
- Make a BLAST Database
- BLAST Sequences
- Taxon Assign
- Combine Taxon Assign
- Reduce Taxa Assign
- Combine Reduced Taxon Assign
- Mapping Dashboard
- References
The data for the tutorial comes from the publication Young et al. 2021 which passively collected insects using Lindgren funnel traps with a high salt solution to survey for invasive insects. The insects were removed from the traps and the salt solution was retained and used to extract environmental DNA (eDNA). The data used in the examples in this tutorial are a subset of the data obtained.
To download the data for the tutorial go to the main DBTCShinyTutorial GitHub repository page and select the green button labelled 'Code' and then select 'Download ZIP' (see image below). This will save all of the tutorial files you need to your local computer (Please note that this is a fairly large file approxitmately 526mb). Once downloaded, extract the contents of the compressed file and place the 'DBTCShinyTutorial-main' in an accessible area of your computer ideally near the root.
Also, please make sure that you have permissions for this location so that the files will be accessible for reading, the folders for writing data, and that the blastn and makeblastdb files are able to be executed (if you are using these program files instead of loading the programs as trusted software into your system). For Windows, it is not often a problem to run these programs. For Linux terminal and Mac OS terminal open a terminal window and navigate to the container holding the 'DBTCShinyTutorial-main' folder you have downloaded and extracted. Once there, use the following command to change the permissions on the BLAST program files (Note: you may need user access to change the file permissions).
chmod -R 0777 DBTCShinyTutorial-main
For Mac OS another issue may arise where the BLAST software is not recognized as a trusted program.
To allow the computer to run makeblastdb as a trusted program you will need to navigate to the 'DBTCShinyTutorial-main' in a finder window. Once there go to the file of choice (makeblastdb or blastn) and do the following...
Right-Click the Application: Instead of double-clicking to open the file, right-click on the program file.
Choose Open: From the context menu that appears after right-clicking, choose "Open." This action should present you with an option to open the file even though it's from an unidentified developer.
Open Anyway: After selecting "Open," macOS will display a warning dialog stating that the application cannot be opened because it's from an unidentified developer. In this dialog, there should be an "Open Anyway" button. Click on this button to bypass the warning and open the program file.
Confirm Your Choice: You may need to enter your administrator password or confirm your choice before macOS allows the file to be opened. Follow the on-screen prompts to complete this step.
Once complete the program should be usable (Note: the process to indicate that this is a trusted program may need to be completed with every MacOS session). Rerun the DBTCShiny and attempt your analysis again.
When working with the DBTCShiny application there are some limitations. The functions in the package make use of several external resources and programs. The programs do not work well when there is white space in the naming conventions of files and directories. As such it is recommended that when using DBTCShiny, working files and folders be as close to the root directory (computer or external hard drive).
Also, when using DBTC functions naming conventions need to carefully considered. Special characters should be avoided (including question mark, number sign, exclamation mark). It is recommended that dashes be used for separations in naming conventions while retaining underscores for use as information delimiters (this is how DBTC functions use underscore).
There are several key character strings used in the DBTC pipeline, the presence of these strings in file or folder names will cause errors when running DBTC functions. The following strings are those used in DBTC and should not be used in file or folder naming:
- _BLAST
- _taxaAssign
- _taxaCombined
- _taxaReduced
The DBTCShiny can be run through the terminal window, R window, or through Posit.
All descriptions and comments used in the tutorial below relate to the DBTCShiny application and the graphical user interface to implement the functions.
To begin, ensure that the DBTCShiny package is installed in your instance of R and that you have loaded the package through the library('DBTCShiny') command (see the DBTCShiny GitHub repository).
Once all dependencies and the DBTCShiny package are installed and loaded, start the program using the following command...
launchDBTCShiny()
Once initiated a window will appear in your computers default web browser with the title screen for DBTCShiny (see below)
There are several expandable panels on the initial page that can be explored for more information on installation, dependencies, and contact for troubleshooting.
The side panel (left side of the screen) of DBTCShiny contains two options in addition to the welcome option. The first option, DBTC Tools contains all of the main high-throughput processing functions of DBTCShiny and this section is where we will start. Once DBTC Tools is clicked it will show a page with the 8 tabs for the main functions of DBTCShiny (See image below).
Shiny applicaitons use buttons to select files necessary when running analyses. These buttons will bring up a dialog window (referred to as an 'Open' dialog in Mac OS or an 'Open File' dialog in Windows and a 'File Picker' dialog in Linux and referred throughout this tutorial as a 'select file dialog window').
Finally, before moving on be sure that you have the tutorial data downloaded and stored on your local machine (see here if unsure Data).
DBTCShiny uses dada2 to complete the analysis of raw fastq files generated from high-throughput sequencing runs (A run is a group of results processed at the same time on the same machine representing the same molecular methods). While this tutorial will cover some of the settings when implementing dada2, for more details on the specifics of available settings please read through the dada2 documentation. There are five sections within the DBTCShiny Dada submission. We will reference each section in turn below with respect to running our example analyses.
This section provides file locations, processing, and saving options for the Dada analysis of Fastq files.
Once clicked this button will open a select file dialog window to select the data files to process.
The 'Fastq File' selection will need to be in the appropriate file structure (see Dada Implement Input in the DBTCShiny ReadMe). Move to the location of your downloaded and extracted tutorial files.
Once there we will make use of the A-Dada folder and use one of the two options either Bidirectional or Unidirectional (remember this selection as it will matter in a couple of steps). For the purposes of this tutorial we will use the Bidirectional data folder.
Then select either one of the runs (here we will use the first Run folder but either could be used).
And Finally, we will select one of the fastq files. Note: that these files are compressed, but are not paired within a compressed folder, the files are paired through naming convention see Forward and Reverse Identifiers.
The second button is the Primer File button and this option will be utilized for this tutorial. Again, once selected it will bring up a select file dialog window (Note: if this button is not utilized or the selection is cancelled then the dada_implement() function will run without the use of a primer file and primer matching and trimming).
Navigate to the same A-Dada folder. In this folder there are two primer files (SaltTrapPrimers-Bidirectional.tsv and SaltTrapPrimers-Unidirectional.tsv). Select the bidirectional primer file for this tutorial (please note if unidirectional was selected above you will need the unidirectional primer file).
The format of this file can be viewed by opening it or you can see the description in the DBTCShiny ReadMe Dada Implement Format of the Primer File section. The data in this file will be used with the R ShortRead package and the trimLRPatterns() function to remove primers by pattern matching from the ends of the sequences. The contents of this file should only include AGCT nucleotides and not any other IUPAC codes. If there is degenerency in the primers which are represented by IUPAC codes, then these primers need to be represented by multiple primer options in the table only containing AGCT nucleotides. Note: With your own data if you are getting poor quality matches to database sequences it could be due to the presence of primers and other indices or tags which were not properly removed from the sequence data and the contents of this file should be considered in these cases.
There are three options possible to indicate the directional data present in your samples.
Bidirectional - If selected the data will be processed using both forward and reverse reads. The Dada analysis will merge these reads and the data coming out of the analysis will be representative of both the forward and reverse reads. For the purposes of this tutorial we will process the data only bidirectionally.
Unidirectional - If selected the data will be processed only in the forward direction. The outcome of the analysis will not be from paired and merged reads but will only represent data from a single direction. Note: if this is selected the Forward identifier and the Reverse identifier sections will disapear from the DBTCShiny interface as these are not needed to identify the forward and reverse elements of paired reads when only processing with one direction (see image below).
Both - If selected there will be three parallel runs of the data, one merged using both forward and reverse reads, a second analysis using foward reads, and a third analysis using reverse reads (descriptors to indicate the type of analysis will be present on output files and within summary data files. Note: processing using both will significanlty increase the time to analysis, however, the independent forward and reverse analyses may be helpful to assess if there was poor amplification or quality in one of the directions. This could possible help to identify reasons for poor merged data or help to better understand primer issues.
There are two fillable fields, one for forward and one for reverse. These fields indicate the patterns in the naming convention of the files that will identify if the data file contains forward or reverse data.
The default contents of these fields are patterns often used in MiSeq generated files. If you have other naming conventions you can type them in here now. For this tutorial we will use the default patters.
This binary selection will, if yes is selected, output files in .pdf format with plots of the quality for the reads being processed.
These results will be generated for the initial fastq files and the cleaned and trimmed fastq files. The data from the quality analyses are also reported in the final reporting table but these files allow you to visualize the data. The default for this selection is to produce these files, for the purposes of this tutorial we will select no and not produce these files.
This section has a single field that accepts numeric input. This value is used by the ShortRead trimLRPatterns() pattern matching function when pattern matching to trim primers and other artificial nucleotide sequence data at the end of reads.
The default for this field is 2 which will allow up to two mismatched nucleotides between primer sequences and experimental reads. If there are larger primer regions then increasing this value could be considered. For example with the average primer region of 20 then the default 2 for this value would be appropriate. However, if the primer region were 30 nucleotides long then a value of 3 could be used here. For the purposes of this tutorial the default value of 2 will be used.
This section contains field used by dada2 when end trimming and quality filtering.
The first two fields accept values for the dada2 end trimming function (dada2 parameter trimLeft). This function trims from the left and each of the two values here will trim the forward or reverse read respectively (For more information see here). When pattern trimming this function is not utilized. The default for these values is 0 (and therefore this function is not used) and these are the values that will be used in this tutorial.
The third field accepts a value for the dada2 maximum number of expected errors. length trimming function where it identifies the maximum expected errors value (dada2 maxEE parameter and for more information see here). For the purposes of this tutorial we will use the default value of 2
The next fillable fields provide data on quality trimming and length trimming.
The truncation value field (dada2 parameter truncQ for more information see here) provides information on the quality where the sequences are trimmed. The default for this value is 2 and this is what will be used for this tutorial.
The final two filable fields in this section are the truncLenValueF and the truncLenValueR values. These values trim the nucleotide sequences based on the overall expected length of the sequences (for more information see here). When pattern trimming this function is not used and so these values are set to 0 by default and we will use 0 for the tutorial.
This section provides information on the error assessment and quality filtering of sequences using the dada2 learnErrors() function.
The first fillable field in the Dada learnErrors section indicates the percent of data to use when assessing errors. The dada2 analysis generally evaluates the total number of reads present in a submitted run (a run is a group of results processed at the same time on the same machine representing the same molecular methods) and then takes a percentage of the sequences and estimates the error present. The DBTCShiny uses the same learnErrors() function, but instead forces the function to use a subset of the data by fastq file wholistically. In doing this we can provide the files used for the error estimation and the process is reproducible. The first fillable section here identifies the percentage of fastq files used for the analysis. The default is 0.1 or 10%, but in cases where there are very few results files and when the percent of the files is three or fewer, the minimum number of files is automatically set to three. For our tutorial we will use the default value of 0.1.
The second fillable field is for the upper value of the total number of nucleotides used for the error assessment. This value is set very high to allow the selection of files for the error assessment. Lowering this value will cause the assessment of the errors to potentially use only a subset of the selected files. The default of value for this field is set to 1,000,000,000 and this is the value we will used for this tutorial.
This section includes fields necessary for the merging of pairs. Please note that some elements of this section will disappear if the unidirectional data analysis is selected (in the Gerneal Information section).
The first fillable field in the Dada mergePairs section provides a value for the maximum number of mismatches allowed in the merge analysis. The mismatches represents the number of acceptable mismatches in the overlapping sections between merged forward and reverse reads. The smaller this number the more constrained the analysis and the more potential merged pairs will be rejected. As the overlapping region increases, the corresponding increase in the acceptable number of mismatches may result in more reads at the end of the analysis. If this value is lower it will also provide higher confidence in the reulting reads. The default value is set to 2 and this is the value that will be used in this tutorial.
The second field in this section is the minimum total number of overlapping nucleotides required to merge forward and reverse pairs. With increasing overlap this value can also be increased or if kept at the default of 12 as the overlap increases the stringency of the merge will also increase. The default value is set to 12 and this is the value that will be used for this tutorial.
The third field is the trim merged reads field. If set to TRUE, this field will trim the merged read to only include data with two nucleotides, one in the forward and one on the reverse read. If set to FALSE then the merged pairs will include longer tails on the ends of the merged sequences in both the forward and reverse directions. The default value if FALSE and will be used for the tutorial.
The final field provides an input value for the final total length of the reads coming out of the analysis. This value is set as a default value of 100 and can be adjusted depending on the expected length of the reads based on the molecular primers. The tutorial will use the default value of 100.
At the bottom of this section there is a button labeled 'Dada Submit'. Clicking this button will initiate the running of the dada_implement() function.
The output from the dada_implement() function can include up to four file folders in each of the Run folders submitted. In our example we are missing the output folder 'A_Qual' and 'C_FiltQual' as we selected FALSE for the print quality plots option. If this was set to TRUE then there would be two additional folders with plots displaying the quality metrics of each of the raw (in folder 'A_Qual') and filtered and trimmed (in folder 'C_FiltQual') results.
The 'B_Filt' folder contains the filtered files and trimmed results (in the 'Primer_Trim' subfolder) in fastq compressed files. These are provided as they represent the the raw output from the analysis and could be evaluated using other programs outside of the DBTCShiny pipeline.
There are five main types of files as output from the dada_implement():
dadaSummary
A text based summary of all of the settings used to process the fastq files in this instance of dada_implement()
dadaSummaryTable
The dadaSummaryTable contains descriptive data about all of the fastq files processed from the run where the output files are located.
Dada Summary Table headers
- inputReads
- fwdInputQualMin
- fwdInputQualMedian
- fwdInputQualMean
- fwdInputQualMax
- revInputQualMin
- revInputQualMedian
- revInputQualMean
- revInputQualMax
- filteredReads
- fwdDenoisedReads
- fwdDenoisedAvgQual
- fwdDenoisedAvgLen
- fwdDenoisedUniqueReads
- revDenoisedReads
- revDenoisedAvgQual
- revDenoisedAvgLen
- revDenoisedUniqueReads
- mergedReads
- mergedReadsAvgLen
- noChimReads
- noChimTrimReadsOverMinLen
- noChimTrimUniqueReadsOverMinLen
- noChimTrimAvgLenOverMinLen
- finalCleanedPerReads
Error Assessment Visualizations
The Dada analysis will assess the fastq files for likely instances of nucleotide errors. These data are represented visually in the 'ErrorForward' and 'ErrorReverse' pdf files. The interpretation of these data are well covered here.
Run ASV Tables and Paired Fasta Files
The main output from the dada analysis are the (ASV. files and their associated Fasta files. There are three potential paired ASV-fasta files, Merged, Forward, and Reverse. As we only selected to run the Merged analysis, only the merged files are present for this tutorial example.
The format of the (ASV) files are tables which, at their most basic, include the sequence reads obtained through the analysis and the associated number of those reads for each sample analysed. For the DBTCShiny output, there are several other data that are included in the (ASV) output tables.
- UniqueID - A unique identifier assigned to the read
- Length - the length of the read in number of nucleotides
- Results - The analysis where the results were obtained (this tutorial example will indicated Merged).
TotalTable
The total table provides the same format as the paired ASV and Fasta main output files. However, all of the data are combined and retained in this output file. Results in this file could include Merged, Forward, Reverse, and ChimRemoved qualifiers in the Results column. The Merged, Forward, and Reverse indicate from which analysis the reads were obtained. The ChimRemoved identifier indicates that the dada analysis did not place the associated read in the paired ASV-fasta files as they were assessed as a chimera sequencing error.
Most of the files generated by the dada_implement() function exist to help researchers evaluate the quality of the samples and the efficacy in how they were analysed. In addition, the output files provide all of the information necessary for reproducibility. Once these two elements are satisified, the only files needed for further analysis using the DBTCShiny analysis pipeline are the paired ASV-fasta files. The downloaded examples provided for this tutorial have already moved the Merged ASV-fasta files to the appropriate location for the next steps. However, if you are using your own data it is best to take the paired ASV-fasta files and place them in a new folder to continue your analysis.
The output from the dada_implement() function includes the paired ASV-fasta data files. Where more than one sequence run has been completed for a given project, it is often helpful or needed to combine the data from the different runs into a single data file for further analysis (Note: the files being combined should be from the same molecular protocols). Only the (ASV) files are required for this combine output function. For the purposes of this tutorial we are using the 'Run1_Merge.tsv' and the 'Run2_Merge.tsv' files from the dada_implement() output. To select these files simply click the 'Select a File in the Target Folder' button and then a select file dialog window will open. Navigate to the B-CombineDada folder and then select one of the files in this folder and the function will process all '_Merge.tsv', '_MergeFwdRev.tsv', and '_Forward.tsv' files into paired ASV-fasta files.
The combine_dada_output() function produces three output files. The paired ASV-fasta files and a summary file containing information on the implementation of the function (see the image below of the output files for this tutorial).
There are no data files from previous DBTCShiny elements that lead into the make_BLAST_DB() function. Instead, an externally created fasta file is necessary to establish a custom BLASTable database. This file can be manually created or can be more effectively and efficiently created through the use of the MACER package. In either instance the final format must be in the MACER format (see below).
- The MACER fasta header format -
>UniqueID|OtherInformation|Genus|species|OtherInformation|Marker
The make_BLAST_DB() shiny page has five elements. The first element is the 'Fasta File' selection button. Once clicked this button will again bring up a select file dialog window.
Navigate to the tutorial folder 'C-MakeDB'. For this tutorial there is a single fasta file example (2024_01_05_COI-5P_393_Species_of_Concern.fas) and three folders each containing the program file for the BLAST+ makeblastdb program for different platforms. For this tutorial example select the 2024_01_05_COI-5P_393_Species_of_Concern.fas file.
Note: The species of concern sequences fasta file contains sequence data for the COI-5P molecular region for all insect taxa on the CFIA list of invasive and regulated pests. These data were obtained from the Barcode of Life Datasystems and NCBI GenBank databases using MACER.
The next element is the 'makeblastdb Location' button. If this button is not selected, or if the file selection is cancelled the program will still run and will assume that the BLAST+ makeblastdb program is installed and accessible from all folders on your computer (The program has been added to the computers path). If the program is not in the computers path then the appropriate makeblastdb program file needs to be selected for the computers operating system (See image of files below).
For ease, this tutorial includes a version of the program for three platforms (NOTE: these programs are not updated regularly and may not be the most recent versions). To utilize these program files select them in the select file dialog window, but remember to ensure the program files have permissions.
The 'Select the NCBI Taxon Database File' button for this function will open a third select file dialog window where you will need to select the 'accessionTaxa.sql' data file (See the accessionTaxa instructions).
There are then two fillable fields required for the final elements of the make_BLAST_DB() function.
The first fillable field is the minimum length of the sequence (in nucleotides) for records that will be included in the constructed database. For this tutorial the default of 100 is used.
The second fillable field accepts alpha numeric values. This field should be filled with a short (2 to 10 character) string as a unique identifier for the database you are going to construct. As this tutorial doesn't rely on the previous steps, and descriptive naming can be used here. It is suggested that 'SOCTUT' would work well for this tutorial.
Finally using the 'Create BLAST Database Submit' button will initiate the program.
The seq_BLAST function has three buttons and three fillable fields.
NOTE: While the DBTCShiny package has been built for the analysis of high-thoughput sequencing results, the BLAST and taxonomic assignment, taxonomic condense, and mapping functions can be utilized with single specimen Sanger sequencing data.
The 'Database File' button will launch a select file dialog window. Using this window, the user is required to select a file inside a constructed BLAST formatted database (See image below of files contained in a BLAST formatted database). Any one of the files in the BLAST formatted database folder can selected.
The next element is the 'BLASTn Location' button. If this button is not selected, or if the file selection is cancelled the program will still run and will assume that the BLAST+ blastn program is installed and accessible from all folders on your computer (The program has been added to the computers path). If the program is not in the computers path then the appropriate blastn program file needs to be selected for the computers operating system (See image of files below).
The final button is the 'Query Fasta File' button. This will again bring up a select file dialog window where the user will need to select a Fasta file of interest to BLAST against the selected database. For this tutorial the user should select the 2024_02_14_0409_combineDada.fas file. (See image below).
NOTE: In this tutorial example there is a single Fasta file in the 'D-BLAST' folder. However, the seq_BLAST function will process all Fasta files in the selected location BLASTing them against the indicated BLAST database. Also, if this fasta has paired ASV-fasta data, this will be recombined with the BLAST results using the taxon_assign() function.
There are three fillable fields wih arguments for this function (See image below). The first fillable field accepts a numerical input. This value indicates the maximum number of returned results that will be saved to the seq_BLAST output file In cases where there is a small BLAST formatted database the results will not reach the maximum indicated here. However, it is more than likely that all sequences queried against very large databases, such as the NCBI GenBank nucleotide database, will all have the maximum number of returned results. For this tutorial we will use the default of 200 as our BLAST database is small with known clean data records so this value will easily saturate the results with matches to high quality data.
The second fillable field also accepts a numeric value for the minimum length of sequence used in the submitted Fasta file to BLAST against the indicated database (Note: the smaller the length the more computationally demanding the BLAST). This value is usually set to a known value close to the expected fragement size of the data submitted to the seq_BLAST function. For this tutorial we will use the default value of 100.
Finally, the third fillable field also accepts a numeric value indicating the total number of computer cores to utilize when running the seq_BLAST function. This value can be utilized on Linux and MacOS, but will automatically be set to 1 when running the function in a Windows environment. If using multiple cores in your analysis please ensure you do not utilize all available cores as there is a need to retain computational capacity for the operating system and R. This tutorial will use the default of 1 as the dataset and database are very small and so the running of seq_BLAST will proceed quickly on most recently built computer systems. However, running this function more then once testing different cores, where the operating system allows, is encouraged.
The taxon_assign() function has two button select file dialog window elements, six fillable fields, and one TRUE/FALSE selection.
The taxon_assign() function processes BLAST results with paired fasta files. If paired ASV-fasta files are availble in the same location as the '_BLAST' and Fasta files, these (ASV) data will be combined with the taxonomic assignment results from the BLAST and fasta files.
The first button is the 'Select a file in the location of BLAST and Fasta file' button. This will again bring up a select file dialog window where the user will need to select a file in the location where the BLAST and paired Fasta file are located. For this tutorial the user should select the any of the files in the E-TaxaAssign folder. (See image below).
NOTE: In this tutorial example there are two BLAST results representing the same data and two differetn paired with a single Fasta file in the 'E-TaxaAssign' folder. However, the taxon_assign() function will process all BLAST files present if there are paired Fasta files in the selected location (and will combine the BLAST results with the paired ASV-fasta Fasta file.
The 'Select the NCBI Taxon Database File' button for this function will open a third select file dialog window where you will need to select the 'accessionTaxa.sql' data file (See the accessionTaxa instructions).
The first fillable field accepts a numeric value indicating the total number of computer cores to utilize when running the taxon_assign() function. This value can be utilized on Linux and MacOS, but will automatically be set to 1 when running the function in a Windows environment. If using multiple cores in your analysis please ensure you do not utilize all available cores as there is a need to retain computational capacity for the operating system and R. This tutorial will use the default of 1 as the dataset and database are very small and so the running of taxon_assign() will proceed quickly on most recently built computer systems. However, running this function more then once testing different cores, where the operating system allows, is encouraged.
There are then five fillable fields that when changed can alter the results for the taxonomic assignment output files. The first of these two fields contain the nucleotide sequence coverage and identity threshold values. These values represent percentages out of 100 and are in whole numbers for ease of reporting in the output files. The coverage and identity (ident) values are used to filter the BLAST results where only query to database records with higher values then assigned will be used to assess the 'Lowest_Single_Rank_Above_Thres' and 'Lowest_Single_Taxa_Above_Thres' values in the output (ASV) table (See below for an image of the output table structure and see here for explainations of the data columns).
The next fillable field accepts a numeric input with values between 0 and 1. The 'propThres' value looks at the taxonomic rank directly below the assessed rank indicated in the 'Final_Rank' and 'Final_Taxa' columns. If the results from the BLAST at the lower taxonomic level had more than the 'propThres' percentage of results in agreement to the 'Final_Taxa' at the 'Final_Rank' but not all were in agreement, then the record in question would be flagged so that the researcher can evaluate the records making up the results at the rank below the assigned rank. This flag would appear in the 'Result_Code' section and would show 'TBAT(0.95)'.
In the data for this tutorial there was an example of this which is highlighted below...
genus | species |
---|---|
Mordellaria(25,96,97.19,0) | Mordellaria borealis(1,96,99.75,0), Mordellaria serval(24,96,97.19,0) |
In this example from the ST BLAST formatted database the tutorial results will show a TBAT(0.95) flag present in the 'Result_Code' column in the taxon_assign() '_taxaAssign' output file. This flag would be present as the taxonomic assignment of the genus Mordellaria, while accurate, was based on results at the species level with two species assignments. However, there was only a single record indicating Mordellaria borealis representing 4% of the returned results meaning that Mordellaria serval represented over the set 'propThres' of 95% which resulted in the flag to inform the researcher that this could be a result needing further consideration. For the purposes of this tutorial we will use the default value of 0.95.
The next two fillable fields, 'coverReportThresh' and the 'identReportThresh', accept numeric input between 1 and 100. These fields look at the final assessed taxonomic result and place flags into the 'Result_Code' column if the final coverage and identity records that fall below the 'coverReportThresh' and the 'identReportThresh' respectively. The flags populating the 'Result_Code' column are BIRT(identReportThresh) indicating that the final taxa result is below the identity reporting threshold and BCRT(coverReportThresh) where the final taxa result is below the nucleotide coverage reporting threshold. For this tutorial we will use the default values of 95 for both the 'coverReportThresh' and the 'identReportThresh'.
There are instances where the BLAST does not yeild results for the submitted nucleotide sequence (This is a function of the BLAST algorithm and the default set reporting threshold of an e-value of 10 or greater (see here for details). The final 'includeAllDada' TRUE/FALSE selection will, if set to TRUE, populate the final table with records that failed to obtain a BLAST match to the database and instead of taxonomic hierarchy will place NA in the ranks sections. If this is set to FALSE then only records that had at least a 'superkingdom' assignment will be included. For the purposes of this tutorial will use the TRUE selection and include all records in the output, with or without BLAST matches. However, it is recommended that the user experiment with both options.
The final 'Taxon Assign Submit' button will then initiate the taxon_assign() function.
The combine_assign_output() function takes multiple outputs from the taxon_assign() function representing the same data but with differing taxonomic assignment files that were generated using different BLAST databases and combines them into a single file.
The 'Taxa Assign File' button opens a select file dialog window where the user will navigate to the 'F-CombineTaxaAssign' folder and select one of the '_taxaAssign' files in this location. This function will combine all '_taxaAssign' files in this location into a combined file. The user will need to ensure that all of the same files in this location represent the same root name and underlying data but with BLAST data from a different database. If there are multiple '_taxaAssign' files for multiple root names (meaning stemming from different initial datasets), this function will attempt to combine them which is NOT a desired outcome.
The only fillable field accepts a numeric value indicating the total number of computer cores to utilize when running the taxon_assign() function. This value can be utilized on Linux and MacOS, but will automatically be set to 1 when running the function in a Windows environment. If using multiple cores in your analysis please ensure you do not utilize all available cores as there is a need to retain computational capacity for the operating system and R. This tutorial will use the default of 1 as the dataset and database are very small and so the running of taxon_assign() will proceed quickly on most recently built computer systems. However, running this function more then once testing different cores, where the operating system allows, is encouraged.
Finally, the click the 'Combine Taxa Assign' button to start the function. The output from the combine_assign_output() function is a single '_taxaAssignCombine' file (ASV) file and a text file with the details of the running of the function. Note: There can be only one set of '_taxaAssign' files in the selected location meaning all files stem from the same root files and have the same root name.
The reduce_taxa() function will take '_taxaAssignCombine' or '_taxaAssign' files and reduce the taxonomic assignments so that there is a single unique taxa listed. This function works per file and produces a sister '_taxaReduced' file to every '_taxaAssignCombine' or '_taxaAssign' file present in the selected directory. This is essentially taking the (ASV) table and creating a taxonomic table.
There are two fields for this function. The first is the 'Taxa Assign File Location' button that opens a select file dialog window. Navigate to the location with your files of interest, **in the case of this tutorial select the 'G-ReduceTaxa' folder and select any file at this location. **
The second field accepts a numeric value indicating the total number of computer cores to utilize when running the taxon_assign() function. This value can be utilized on Linux and MacOS, but will automatically be set to 1 when running the function in a Windows environment. If using multiple cores in your analysis please ensure you do not utilize all available cores as there is a need to retain computational capacity for the operating system and R. This tutorial will use the default of 1 as the dataset and database are very small and so the running of taxon_assign() will proceed quickly on most recently built computer systems. However, running this function more then once testing different cores, where the operating system allows, is encouraged.
The final central DBTCShiny function is the combine_reduced_output() function. This function requires two pieces of data for implementation. The first is the location of the '_taxaReduced_YYYY_MM_DD_HHMM.tsv' files that the user would like to combine. These files can represent the same samples but having been processed with different molecular protocols and representing different molecular maker data. To select these files simply click the 'Reduced Taxa File Location(s)' button and then a select file dialog window will open. For the purporse of this tutorial, navigate to the H-CombineReduced folder and then select one of the files in this folder. The second element that needs to be selected is the TRUE or FALSE value for combining with read counts or reducing results to presence abasence data (1/0). ONe thing to note is that if maintaining the number of reads that these data cannot be directly compared across molecular markers.
The final element of the DBTCShiny program is a mapping section where users can evaluate the veracity of taxonomic assignments against collection location and collection related data. To use this function the 'Maping Dashboard' tab on the left side panel of DBTCShiny shoudl eb selected.
Once selected the mapping dash board will appear with four possible tabs.
The data import tab
Young, R. G., Milián‐García, Y., Yu, J., Bullas‐Appleton, E., & Hanner, R. H. (2021). Biosurveillance for invasive insect pest species using an environmental DNA metabarcoding approach and a high salt trap collection fluid. Ecology and Evolution, 11(4), 1558-1569.