Skip to content

Latest commit

 

History

History
129 lines (117 loc) · 7.79 KB

Readme.md

File metadata and controls

129 lines (117 loc) · 7.79 KB

Installation

In order to use TransFlow you must:

    1. Clone this repository. Code: git clone https://github.com/seoanezonjic/TransFlow
    1. Install ruby. We recommend to use the RVM manager: RVM
    1. Install AutoFlow. Code: gem install autoflow
    1. Install html reporting. Code: gem install report_html
    1. If you are interested in to use the assembly modules (1, 2, 3), you must install:
    1. The reference and validation modules need:
      1. Install Full-Lengther next. Code gem install full_lengther_next
      1. Install BUSCO and download the lineage files to BUSCO_DB/ folder
      1. Install R and the following R packages:
      install.packages(c('optparse', 'FactoMineR', 'corrplot', 'factoextra', 'fmsb', 'PerformanceAnalytics'))
    1. Make PATH accesible all the installed software.
    1. Make PATH accesible the folder scripts.

Usage

Configuration

There is an editable batch file to perform custom executions (the folder samples has sh files used with the paper data) called launch_TransFlow.sh. You can edit the following variables:

 TEMPLATES # modules of TransFlow that will be used.
 reference # folder with the fasta files of the chosen transcriptome references.
 reads  # folder with the fastq files that will be mapped against the transcriptome references
 read_454 # path to the 454/Roche reads that must be used in the assembly process
 ill_type # describe whether the Illumina reads are paired or single
 read_illumina_pair_1 - read_illumina_pair_2 / single_illumina # path to Illumina paired/single files, respectively
 BUSCO_DB # specific lineage for using in the BUSCO analysis
 FLN_DB # database to use in Full-Lengther Next executions
 kmers # k-mers used by the assemblers. Must be specified as "[kmer1;kmer2;kmer3;..;kmerN]" 
 key_organisms # identifiers in the assembly summary table for using as reference transcriptomes. Must be specified as "ref1_ref2_ref3_.._refN" 

These variable are preconfigured an you only have to put the files on transcriptome_references and assembly_reads. Then, you have to change the upcase placefolders in launch_TransFlow.sh by your file's names. if you want, you can change this variables to full paths at your convenience. When you change the reference files and their respective read files you must edit the reference template (in Templates folder) in order to consign the used fasta file, which ID muts have the reference and which read files must be used in the analysis process. The fasta file is specified in the following way:

reference_arab){
       ln -s $reference/Athaliana_167_TAIR10.transcript all_sequences.fasta
       ?
       echo -e "A.thaliana\tref_p\tref_t\tref_k\tref_t" > tracker
}

Athaliana_167_TAIR10.transcript is the fasta file used as reference and A.thaliana is the ID that we want when this transcriptome is listed in the results. If you add a new reference, be careful to change the task name reference_arab) to a uniq name as reference_new_trascriptome). To set the read files you must chage this code:

FLN_metric_Arab){
    #resources: -s -u 3 -n cal -c 48 -t '7-00:00:00' -m '20gb'
    ln -s reference_arab)/tracker
    ?
    full_lengther_next -s 10.243 -f reference_arab)/all_sequences.fasta -a 's' -z  -g $FLN_DATABASE -c 500 -q 'd' -w [lcpu] -M '$reads/arab_ill_1.fastq,$reads/arab_ill_2.fastq'

In this case, yo must change the file names arab_ill_1.fastq and arab_ill_2.fastq to the paired files that you want to use.

Execution

There are two values that can be used with launch_TransFlow.sh. When is executed ./launch_TransFlow.sh 1 the whole workflow is launched. If is executed ./launch_TransFlow.sh 2, a control log is shown and the user can inspect the workflow progress.

Add new assemblers

If it is desired to add new assembler to the assembly modules, it is necessary to follow a few conventions. The following example is the Ray assembler used in the Illumina module:

RAY_primary_assembling_$kmers){
        echo -e "ctRayK(*)\tRay\tprimary\t(*)\tIll" > tracker
        if [ "$ill_type" == "paired" ]; then
                input_files='-p Cleanup_ill)/output_files/paired_1_.fastq Cleanup_ill)/output_files/paired_2_.fastq'
        elif [ "$ill_type" == "single" ]; then
                input_files='-s Cleanup_ill)/output_files/sequences_.fastq'
        else
                echo 'Read input files not set.'
                exit 1
        fi
        ?
        mpiexec -np [cpu] Ray -k (*) $input_files 
        if [ ! -s RayOutput/Contigs.fasta ] || [ ! -s RayOutput/Scaffolds.fasta ]; then
                echo "ERROR: Ray primary assembly or scaffolding have failed"
                exit 1 # Fail
        fi

        ln -s RayOutput/Contigs.fasta all_sequences.fasta
}

Conventions:

  • You must add a node with the name ASSEMBLER_primary_assembling_$kmers)

  • The next line must describe the assembly as echo -e "ctASSEMBLY_NAME_K(*)\tPROGRAM_NAME\tprimary\t(*)\tIll" > tracker

    • ctASSEMBLY_NAME_K(*) is the assembly identifier and must be filled with the desired name, the (*) expression will be replaced by the specified kmer value automatically.
    • PROGRAM_NAME must the used software name. This parameter is used as factor in PCA.
    • primary indicates that the assembly is performed with pre-processed reads. This parameter is used as factor in PCA.
    • Ill indicates thaht the used reads are from Illumina platform.
  • A conditional code using the $ill_type variable must be placed in order to check which files must be used and the assembler parameters must be configured taken into account this information:

    • If paired files will be used, the path to each of them are (this convention must be taken literally) :
      • Pair 1 => Cleanup_ill)/output_files/paired_1_.fastq
      • Pair 2 => Cleanup_ill)/output_files/paired_2_.fastq
    • If there is a single read file:
      • Single file => Cleanup_ill)/output_files/sequences_.fastq
  • The assembly command must receive a variable with the execution configuration and the kmer parameter must receive as argument the (*) expression.

  • Another conditional must be placed in order to check if the assembler execution it is successful. If not, the execution will be aborted with exit 1

  • The contig file generated by the assembler must be symlinked as following (otherwise it is ignored):

    ln -s RELATIVE_PATH_TO/CONTIG_FILE all_sequences.fasta

  • If the assembler gives two files, one for contigs and another one for scaffolding/post-processing contigs, a new node will be created:

scaffolding_ASSEMBLER_$kmers){
        echo -e "scASSEMBLY_NAME_K(*)\tPROGRAM_NAME\tscaffolding\t(*)\tIll" > tracker
        ?
        ln -s !ASSEMBLER_primary_assembling_*!/RELATIVE_PATH_TO/SCAFFOLD_FILE all_sequences.fasta
}

Citation